Sleep Quality Prediction - Supervised Machine Learning¶


by: Cody Hill

date: 1/18/2024

This project is intended as a demonstration and exploration in training machine learning predictive models on Oura Ring exported user data. The ground truth target variable is the sleep score assigned to each day from the exported data. Along with the goal of creating a well-fit model for this problem, both regression and classification models will be used in this project, exploring the efficacy of turning the discrete sleep scores into ordinal categorical variables.

Since the data collection period is currently limited, this project has been made with the goal of handling continued data uploads (all data cleaning/analysis will be generalized for continuous new data).

This project is licensed under the terms of the MIT License, but is intended for private use only.

If you fork or use any part of this project please attribute Cody Hill as the creator of this work.

Github Repo: https://github.com/chill0121/Supervised-Sleep

Table of Contents ¶


  • 1.Data Source Information
  • 2.Setup
    • 2.1. Environment Details for Reproducility
    • 2.2. Importing the Data
  • 3.Data Cleaning and Processing
    • 3.1. First Looks
    • 3.2. Remove Oura's Score Aggregates
    • 3.3. Removing Irrelevant Features
    • 3.4. Data Munging
    • 3.5. Feature Engineering and Processing
      • 3.5.1. Did You Take a Nap Today?
      • 3.5.2. Aligning the Dataframes - Why workout today when there's always tomorrow?
      • 3.5.3. Removing Days Without Sleep - No rest for the predicted
      • 3.5.4. Binning a Target Variable?!
        • 3.5.4.1 Even-Width Bins
        • 3.5.4.2 Quantile/Median-Centered Bins
        • 3.5.4.3 Custom Quantile Bins
      • 3.5.5. Aggregating the Sleep Metrics
      • 3.5.6. Joing the Dataframes
      • 3.5.7. Dealing with Missing Data - NaNs
  • 4.Exploratory Data Analysis (EDA)
    • 4.1. High-Level Summary
    • 4.2. EDA Helper Functions
    • 4.3. Visualizing the Sleep Metrics
    • 4.4. Remove Day Column
    • 4.5. Correlation Matrix
    • 4.6. Analyzing Bedtime Delta - The moving target of sleep
      • 4.6.1. Hypothesis Testing
        • 4.6.1.1 Two-sample t-test Hypothesis
        • 4.6.1.2 Checking for Normality
        • 4.6.1.3 Checking for Variance Equivalence
        • 4.6.1.4 Performing a Two-sample t-test
    • 4.7. OLS Multiple Linear Regression as a Diagnostic Tool - Pre-Feature Removal
  • 5.Feature Selection/Removal
  • 6.Outlier Removal
  • 7.Train Test Split
    • 7.1. Standardize Variables - Created separately on the same split
  • 8.Supervised Learning Model Selection
    • 8.1. Model Baseline Evaluation Metrics
      • 8.1.1. Classification Baseline
      • 8.1.2. Regression Baseline
    • 8.2. Model Helper Functions
    • 8.3. Model Initializations
      • 8.3.1. Linear/Logistic Regression
      • 8.3.2. Support Vector Machine (SVM)
      • 8.3.3. Decision Trees
        • 8.3.3.1 AdaBoost
  • 9.Results
    • 9.1. Results Helper Functions
    • 9.2. Results - Evaluation Metrics
      • 9.2.1. Regression Models
      • 9.2.2. Classification Models
  • 10.Conclusions

Utilize these links to more easily navigate through the notebook.

Note: A link will be included at the bottom of each numbered section back to the table of contents.

1. Data Source Information ¶


All data has been exported from my personal Oura Ring containing raw biometric data and Oura calculated data since I began wearing the device.

The Oura Ring tracks and records over 20 biometric signals from the sensors on the inside of the ring throughout the day and during sleep. Along with the raw biometric data, Oura's software engineers new metrics to assist in calculating a daily score assigned to categories such as sleep, recovery, readiness, activity, etc.. In total 89 features can be extracted from a user's account giving historical data since the beginning of the user's wear time. Most of these features are daily cumulative sums of metrics or other measures that typically have 1 entry per day, but the sleep data has entries for every time a user is sleeping, potentially allowing for multiple entries per day.

Notable Data Information:

  • Data collection starting 2/3/2023 to 2/6/2024 (last upload) 1/18/2024
  • Oura Ring Gen. 3 | Firmware: 2.9.32
  • 89 Total Features
  • 369 rows of biometric data, which equates to 369 days of data.
  • 730 rows of sleep data, which equates to that number of sleep events recorded.

  • Feature Information (more info can be found at https://cloud.ouraring.com/edu/sleep_score):

    • Sleep Score: (score) Ranging from 0-100, the sleep score is an overall measure of how well you slept.
    • Awake Time: (awake_time) Awake time is the time spent awake in bed before and after falling asleep.
    • Bedtime: (bedtime_start_delta) Bedtime is an estimate of the time you went to bed with the intention to sleep. Delta measures the difference of your bedtime compared to your regular bedtime (calculates continuously).
    • Deep Sleep Time: (deep_sleep_duration) Deep sleep is the most restorative and rejuvenating sleep stage, enabling muscle growth and repair. The amount of deep sleep can vary significantly between nights and individuals. It can make up anywhere between 0-35% of your total sleep time.
    • Light Sleep Time: (light_sleep_duration) Light sleep makes up about 50% of total sleep time for adults, and typically begins a sleep cycle.
    • REM Sleep Time: (rem_sleep_duration) REM (rapid eye movement) sleep is the final sleep stage in a typical sleep cycle. It’s associated with dreaming, memory consolidation, learning and creativity.
    • Total Sleep Time: (total_sleep_duration) Total sleep time refers to the total amount of time you spend in light, REM, and deep sleep.
    • Respiratory Rate: (average_breath) Oura tracks the number of breaths you take per minute while you sleep, and shows your nocturnal average respiratory rate.
    • Sleep Latency: (latency) Sleep latency is the time it takes for you to fall asleep.
    • Average HRV: (average_hrv) When a person is relaxed, a healthy heart’s beating rate shows variation in the time interval between heartbeats.
    • Body Temperature: (temperature_deviation) Oura measures your body temperature while you sleep. It sets the baseline for your normal temperature during the first couple of weeks, and adjusts it if needed as more data is collected. Variations are shown in relation to your baseline, represented by 0.0.
    • Activity Burn: (active_calories) Activity burn shows the kilocalories you've burned by daily movement and exercise.
    • Low Activity: (low_activity_time) Low activity includes activities such as casual walking and light housework both indoors and outdoors.
    • Medium Activity: (medium_activity_time) Medium activity includes dynamic activities with an intensity level equivalent to brisk walking.
    • High Activity: (high_activity_time) High activity includes vigorous activities with an intensity level higher or equivalent to jogging.
    • Inactive Time: (sedentary_time) Inactive time includes sitting, standing or otherwise being passive.
Back to Table of Contents¶

2. Setup ¶


In [1]:
import os
import sys
import numpy as np
import pandas as pd
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, roc_curve, RocCurveDisplay, mean_squared_error, mean_absolute_error, auc, f1_score
from sklearn.preprocessing import StandardScaler, label_binarize, LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeClassifier
import scipy
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import seaborn as sns

2.1. Environment Information for Reproducibility: ¶

In [2]:
print(f"Python version: {sys.version}")

packages = [np, pd, sklearn, scipy, sm, matplotlib, sns]
for package in packages:
    print(f"{str(package).partition('from')[0]} using version: {package.__version__}")
Python version: 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)]
<module 'numpy'  using version: 1.26.3
<module 'pandas'  using version: 2.1.4
<module 'sklearn'  using version: 1.3.2
<module 'scipy'  using version: 1.11.4
<module 'statsmodels.api'  using version: 0.14.1
<module 'matplotlib'  using version: 3.8.2
<module 'seaborn'  using version: 0.13.2
Back to Table of Contents¶

2.2. Importing the Data: ¶

As discussed above, we will import the data using a few separate dataframes.

  1. biometric_df for the biometric data that has one entry per day.
  2. sleep_df for all the recorded sleep events containing potentially multiple entries per day.
  3. daily_sleep_score contains the daily sleep score which will be used as the ground truth target variable.
In [3]:
# Import data
current_wdir = os.getcwd()
data_folder = current_wdir + '/Data'
sleep_data_folder = current_wdir + '/Data/Sleep_Data'

# Iterate through each file in .Data/ and add it to a dataframe.
file_path = [f'{data_folder}/{file}' for file in os.listdir(data_folder) if '.csv' in file]
# Using a list to concat the dfs with index_col allows to easily merge based on 'day'. More memory usage but fine for this project.
biometric_df = pd.concat([pd.read_csv(file, index_col = 'day') for file in file_path], join = 'outer', ignore_index = False, axis = 1).reset_index()

# Iterate through each file in .Data/Sleep_Data and add it to a dataframe.
# Separated sleep data as it potentially has multiple entries per day. Will merge them later.
file_path_sleep = [f'{sleep_data_folder}/{file}' for file in os.listdir(sleep_data_folder) if '.csv' in file and 'daily' not in file]
sleep_df = pd.concat(map(pd.read_csv, file_path_sleep), join = 'outer', ignore_index = False, axis = 1)

# Import ground truth label sleep score.
file_path_daily_sleep_score = [f'{sleep_data_folder}/{file}' for file in os.listdir(sleep_data_folder) if '.csv' in file and 'daily' in file]
daily_sleep_score_df = pd.read_csv(file_path_daily_sleep_score[0])
daily_sleep_score_df = daily_sleep_score_df[['score', 'day']]
Back to Table of Contents¶

3. Data Cleaning and Processing ¶


3.1. First Looks: ¶

Let's take a look what types of data the columns have been given.

In [4]:
biometric_df.dtypes.value_counts()
Out[4]:
int64      25
float64    13
object      4
Name: count, dtype: int64
In [5]:
sleep_df.dtypes.value_counts()
Out[5]:
float64    33
object      9
int64       5
Name: count, dtype: int64

Majority of these 89 features are numerical and 13 features showing up as objects which are typically strings.

Now to display the dataframes to get a better idea what these features are and since there are 89 columns which is too many to display, print out the column names.

In [6]:
# First detailed looks
print('biometric_df info:\n------------------\n', biometric_df.shape)
display(biometric_df.head(2))
print(biometric_df.columns)
print('\n\nsleep_df info:\n------------------\n', sleep_df.shape)
display(sleep_df.head(4))
print(sleep_df.columns)
biometric_df info:
------------------
 (369, 42)
day active_calories average_met_minutes equivalent_walking_distance high_activity_met_minutes high_activity_time inactivity_alerts low_activity_met_minutes low_activity_time medium_activity_met_minutes ... temperature_trend_deviation contributors_activity_balance contributors_hrv_balance contributors_previous_day_activity contributors_previous_night contributors_recovery_index contributors_resting_heart_rate contributors_sleep_balance contributors_body_temperature spo2_percentage
0 2023-02-03 140 1.18750 1969 0 0 1 86 8460 8 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2023-02-04 650 1.59375 10728 16 120 0 277 23220 172 ... NaN NaN NaN 96.0 74.0 97.0 94.0 NaN 90.0 98.523

2 rows × 42 columns

Index(['day', 'active_calories', 'average_met_minutes',
       'equivalent_walking_distance', 'high_activity_met_minutes',
       'high_activity_time', 'inactivity_alerts', 'low_activity_met_minutes',
       'low_activity_time', 'medium_activity_met_minutes',
       'medium_activity_time', 'meters_to_target', 'non_wear_time',
       'resting_time', 'sedentary_met_minutes', 'sedentary_time', 'steps',
       'target_calories', 'target_meters', 'total_calories', 'score',
       'class_5_min', 'contributors_meet_daily_targets',
       'contributors_move_every_hour', 'contributors_recovery_time',
       'contributors_stay_active', 'contributors_training_frequency',
       'contributors_training_volume', 'met_1_min', 'ring_met_1_min', 'score',
       'temperature_deviation', 'temperature_trend_deviation',
       'contributors_activity_balance', 'contributors_hrv_balance',
       'contributors_previous_day_activity', 'contributors_previous_night',
       'contributors_recovery_index', 'contributors_resting_heart_rate',
       'contributors_sleep_balance', 'contributors_body_temperature',
       'spo2_percentage'],
      dtype='object')


sleep_df info:
------------------
 (730, 47)
average_breath average_heart_rate average_hrv awake_time bedtime_end bedtime_start day deep_sleep_duration efficiency latency ... readiness_contributors_body_temperature readiness_contributors_hrv_balance readiness_contributors_previous_day_activity readiness_contributors_previous_night readiness_contributors_recovery_index readiness_contributors_resting_heart_rate readiness_contributors_sleep_balance readiness_score readiness_temperature_deviation readiness_temperature_trend_deviation
0 13.625 63.25 77.0 4440.0 2023-02-04T07:08:22.000-06:00 2023-02-03T22:40:22.000-06:00 2023-02-04 4650.0 85.0 990.0 ... 90.0 NaN 96.0 74.0 97.0 94.0 NaN 89.0 -0.38 NaN
1 15.125 92.49 19.0 6210.0 2023-02-05T09:37:28.000-06:00 2023-02-04T23:54:28.000-06:00 2023-02-05 4590.0 82.0 180.0 ... 100.0 NaN 82.0 79.0 100.0 59.0 NaN 78.0 -0.04 0.15
2 15.125 87.67 NaN 1920.0 2023-02-05T20:10:57.000-06:00 2023-02-05T19:36:57.000-06:00 2023-02-06 30.0 6.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 15.125 82.50 37.0 720.0 2023-02-05T20:36:02.000-06:00 2023-02-05T20:20:02.000-06:00 2023-02-06 0.0 25.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4 rows × 47 columns

Index(['average_breath', 'average_heart_rate', 'average_hrv', 'awake_time',
       'bedtime_end', 'bedtime_start', 'day', 'deep_sleep_duration',
       'efficiency', 'latency', 'light_sleep_duration', 'lowest_heart_rate',
       'movement_30_sec', 'period', 'rem_sleep_duration', 'restless_periods',
       'score', 'segment_state', 'sleep_midpoint', 'time_in_bed',
       'total_sleep_duration', 'type', 'sleep_phase_5_min', 'timezone',
       'bedtime_start_delta', 'bedtime_end_delta', 'midpoint_at_delta',
       'heart_rate_5_min', 'hrv_5_min', 'contributors_total_sleep',
       'contributors_deep_sleep', 'contributors_rem_sleep',
       'contributors_efficiency', 'contributors_latency',
       'contributors_restfulness', 'contributors_timing',
       'readiness_contributors_activity_balance',
       'readiness_contributors_body_temperature',
       'readiness_contributors_hrv_balance',
       'readiness_contributors_previous_day_activity',
       'readiness_contributors_previous_night',
       'readiness_contributors_recovery_index',
       'readiness_contributors_resting_heart_rate',
       'readiness_contributors_sleep_balance', 'readiness_score',
       'readiness_temperature_deviation',
       'readiness_temperature_trend_deviation'],
      dtype='object')
In [7]:
print('\n\ndaily_sleep_score_df info:\n--------------------------\n',daily_sleep_score_df.shape)
display(daily_sleep_score_df.head(4))

daily_sleep_score_df info:
--------------------------
 (355, 2)
score day
0 81 2023-02-04
1 81 2023-02-05
2 89 2023-02-06
3 78 2023-02-07
In [8]:
# Confirm there are no duplicates of days in the biometric or sleep score data.
print(biometric_df['day'].duplicated().sum())
print(daily_sleep_score_df['day'].duplicated().sum())
0
0

Note: Three dataframes have been initialized separately because of how the data is organized.

For instance,

  • There is one entry per day in biometric_df which is primarily related to activity and other biometrics.
  • The sleep data in sleep_df potentially has multiple entries per day, indicating any detected sessions of sleep and their interruptions.
  • The third dataframe, daily_sleep_score_df contains what will be used as the ground truth ($Y$), sleep score.

Feature engineering and transformations will be performed on the two former dataframes before they are all combined for use in the models.

Back to Table of Contents¶

3.2. Remove Oura's Score Aggregates: ¶

First, we must remove the features with labels that include "contributors" as these columns contain standardized scores with which Oura's models have aggregated into various scores which are then used to average out into the final score of various categories.

In [9]:
# Remove contributor and score columns (multiple scores involved, will add truth label later).
biometric_df = biometric_df.loc[:, ~biometric_df.columns.str.contains('contributors|score')]
sleep_df = sleep_df.loc[:, ~sleep_df.columns.str.contains('contributors|score')]

print(biometric_df.shape)
print(sleep_df.shape)
(369, 26)
(730, 30)

We are left with 369 rows in the biometric_df which means 369 days, and 730 rows in the sleep_df which equates to that number of sleep events recorded. We've removed 33 feature columns containing the word 'contributors' or 'score' as we'll be using the score in daily_sleep_score_df.

Let's take a closer look at the remaining features, identifying those that will be of use in our models or not, and continue further data preparations.

In [10]:
print(biometric_df.columns, sleep_df.columns)
Index(['day', 'active_calories', 'average_met_minutes',
       'equivalent_walking_distance', 'high_activity_met_minutes',
       'high_activity_time', 'inactivity_alerts', 'low_activity_met_minutes',
       'low_activity_time', 'medium_activity_met_minutes',
       'medium_activity_time', 'meters_to_target', 'non_wear_time',
       'resting_time', 'sedentary_met_minutes', 'sedentary_time', 'steps',
       'target_calories', 'target_meters', 'total_calories', 'class_5_min',
       'met_1_min', 'ring_met_1_min', 'temperature_deviation',
       'temperature_trend_deviation', 'spo2_percentage'],
      dtype='object') Index(['average_breath', 'average_heart_rate', 'average_hrv', 'awake_time',
       'bedtime_end', 'bedtime_start', 'day', 'deep_sleep_duration',
       'efficiency', 'latency', 'light_sleep_duration', 'lowest_heart_rate',
       'movement_30_sec', 'period', 'rem_sleep_duration', 'restless_periods',
       'segment_state', 'sleep_midpoint', 'time_in_bed',
       'total_sleep_duration', 'type', 'sleep_phase_5_min', 'timezone',
       'bedtime_start_delta', 'bedtime_end_delta', 'midpoint_at_delta',
       'heart_rate_5_min', 'hrv_5_min', 'readiness_temperature_deviation',
       'readiness_temperature_trend_deviation'],
      dtype='object')
Back to Table of Contents¶

3.3. Removing Irrelevant Features: ¶

We'll be removing any features that don't seem relevant to our sleep score prediction model. For now we will not touch most of the potentially "overlapping" features that might introduce predictor collinearity issues, but will remove some obvious ones and address the remainder later.

In [11]:
# Columns to drop
# sleep_df
drop_col_sleep = ['efficiency', 'period', 'segment_state',
            'sleep_midpoint', 'sleep_phase_5_min', 'movement_30_sec',
            'timezone', 'bedtime_end_delta', 'midpoint_at_delta',
            'heart_rate_5_min', 'hrv_5_min'] # timezone? -- might need during EDA
sleep_df.drop(drop_col_sleep, axis = 1, inplace = True)

# biometric_df
drop_col_bio = ['average_met_minutes', 'equivalent_walking_distance', 
                'high_activity_met_minutes', 'inactivity_alerts', 
                'low_activity_met_minutes', 'medium_activity_met_minutes', 
                'sedentary_met_minutes', 'target_calories', 
                'target_meters', 'class_5_min', 
                'met_1_min', 'ring_met_1_min', 
                'temperature_trend_deviation']
biometric_df.drop(drop_col_bio, axis = 1, inplace = True)

Features like efficiency were removed for being too similar to other remaining features as it's just a ratio of time_in_bed and total_sleep_duration which would cause issues in our model later.

average_met_minutes, target_calories, target_meters, etc are daily goals used by Oura to calculate other metrics.

Back to Table of Contents¶

3.4. Data Munging: ¶

We'll also take a look at the remaining features' data types and see if any data munging is necessary (formatting all the features (float, int, dates, dummy/indicator encoding)). At the very least we know we have some date columns that will need to be transformed into pandas' datetime format.

There are a few date columns that would be beneficial to turn into pandas datetime objects.

In [12]:
# Reformat date columns
date_col_sleep = ['bedtime_end', 'bedtime_start', 'day']
# All times reformatted to UTC
sleep_df[date_col_sleep] = sleep_df[date_col_sleep].apply(pd.to_datetime, utc = True, errors = 'coerce')
biometric_df['day'] = biometric_df['day'].apply(pd.to_datetime, utc = True, errors = 'coerce')
daily_sleep_score_df['day'] = daily_sleep_score_df['day'].apply(pd.to_datetime, utc = True, errors = 'coerce')

Display all columns that aren't float or int.

In [13]:
print('biometric_df:')
print(biometric_df.select_dtypes(exclude = ['int64', 'float64']).dtypes)
print('\nsleep_df:')
print(sleep_df.select_dtypes(exclude = ['int64', 'float64']).dtypes)
biometric_df:
day    datetime64[ns, UTC]
dtype: object

sleep_df:
bedtime_end      datetime64[ns, UTC]
bedtime_start    datetime64[ns, UTC]
day              datetime64[ns, UTC]
type                          object
dtype: object

Everything above looks good.

we can see that ['type'] is an object which likely means it's a string. Let's see what values it can take and if it will be useful to keep.

In [14]:
sleep_df['type'].unique()
Out[14]:
array(['long_sleep', nan, 'late_nap', 'rest', 'sleep'], dtype=object)
Back to Table of Contents¶

3.5. Feature Engineering and Processing: ¶

3.5.1. Did You Take a Nap Today?: ¶

Like was mentioned before, the sleep_df contains multiple entries per day, which either is sleep, interrupted periods of sleep during the night, or afternoon naps. It looks like Oura uses the ['type'] feature to indicate what type of sleep each entry belongs to. We can use this information along with the time data in the next section to identify different periods of sleep and in a little feature engineering.

A reasonable solution seems to be to total all the sleep categories that occurred per day, accounting for sleep interruptions at night, and any naps counting towards that total as well. However, we will lose that nap information when we total the sleep times and it might also be beneficial to identify days in which naps occur using a binary boolean (0 = No Nap, 1 = Yes Nap), potentially increasing the sleep score. We will make a new feature with the sleep information before consolidation.

We will identify naps as any entry without the 'long_sleep' ['type'] and that occur between 9:00AM - 6:00PM.

In [15]:
# Initialize a time frame we can consider a nap/rest period (UTC format).
nap_upper = pd.to_datetime('23:59:00').time()
nap_lower = pd.to_datetime('15:00:00').time()
# Condition on the start, end time, and total sleep duration to filter out false-positives, remove any long_sleep types.
nap_bool = ((sleep_df['bedtime_start'].dt.time >= nap_lower) & 
            (sleep_df['bedtime_end'].dt.time <= nap_upper) & 
            (sleep_df['total_sleep_duration'] > 600) &
            (sleep_df['type'] != 'long_sleep'))
print('Naps identified:\n')
display(sleep_df[nap_bool])
# Insert a nap_today column and binary yes/no for each day.
# Initialize column with zeroes.
sleep_df['nap_today'] = np.zeros_like(sleep_df.shape[0])
# Use boolean array to identify nap days and iterate through to change nap_today to 1.
# TODO: This is not the pandas way but is fine with this amount of data. Use a better method later.
# Something like, sleep_df['nap_today'].where(~nap_bool, 1, inplace = True) does not work in this situation..
# Need each entry for that day to = 1 not just the nap itself.
nap_days = sleep_df[nap_bool]['day']
for day in nap_days:
    sleep_df.loc[sleep_df['day'] == day, 'nap_today'] = 1
Naps identified:

average_breath average_heart_rate average_hrv awake_time bedtime_end bedtime_start day deep_sleep_duration latency light_sleep_duration lowest_heart_rate rem_sleep_duration restless_periods time_in_bed total_sleep_duration type bedtime_start_delta readiness_temperature_deviation readiness_temperature_trend_deviation
31 13.500 74.47 34.0 6270.0 2023-02-20 21:16:13+00:00 2023-02-20 19:05:13+00:00 2023-02-20 00:00:00+00:00 420.0 570.0 840.0 67.0 330.0 25.0 7860 1590.0 NaN 47113 NaN NaN
34 13.875 70.14 58.0 3540.0 2023-02-21 23:27:03+00:00 2023-02-21 21:59:03+00:00 2023-02-21 00:00:00+00:00 120.0 3600.0 1140.0 60.0 480.0 23.0 5280 1740.0 NaN 57543 NaN NaN
44 14.750 90.50 25.0 3690.0 2023-02-27 01:11:23+00:00 2023-02-26 23:54:23+00:00 2023-02-27 00:00:00+00:00 300.0 2760.0 540.0 90.0 90.0 24.0 4620 930.0 NaN 60863 NaN NaN
53 14.750 NaN NaN 1140.0 2023-03-01 23:27:00+00:00 2023-03-01 22:51:00+00:00 2023-03-01 00:00:00+00:00 570.0 750.0 420.0 NaN 30.0 11.0 2160 1020.0 NaN 60660 NaN NaN
63 14.625 71.67 NaN 930.0 2023-03-06 22:03:32+00:00 2023-03-06 21:30:32+00:00 2023-03-06 00:00:00+00:00 630.0 390.0 420.0 70.0 0.0 31.0 1980 1050.0 NaN 55832 NaN NaN
88 13.750 65.15 75.0 1530.0 2023-03-17 21:33:33+00:00 2023-03-17 20:17:33+00:00 2023-03-17 00:00:00+00:00 90.0 450.0 2940.0 62.0 0.0 88.0 4560 3030.0 NaN 55053 NaN NaN
89 13.750 68.33 63.0 2070.0 2023-03-17 23:52:33+00:00 2023-03-17 22:48:33+00:00 2023-03-18 00:00:00+00:00 540.0 1410.0 780.0 63.0 450.0 12.0 3840 1770.0 late_nap 64113 NaN NaN
116 13.625 70.40 70.0 3030.0 2023-03-30 20:27:01+00:00 2023-03-30 19:26:01+00:00 2023-03-30 00:00:00+00:00 180.0 690.0 450.0 68.0 0.0 10.0 3660 630.0 NaN 51961 NaN NaN
152 14.125 61.20 107.0 750.0 2023-04-23 21:10:49+00:00 2023-04-23 20:29:49+00:00 2023-04-23 00:00:00+00:00 1260.0 330.0 450.0 56.0 0.0 15.0 2460 1710.0 sleep 55789 NaN NaN
180 14.875 71.20 80.0 1080.0 2023-05-10 18:45:42+00:00 2023-05-10 17:44:42+00:00 2023-05-10 00:00:00+00:00 810.0 570.0 1770.0 67.0 0.0 29.0 3660 2580.0 sleep 45882 NaN NaN
197 13.500 66.75 117.0 1080.0 2023-05-16 23:12:57+00:00 2023-05-16 22:33:57+00:00 2023-05-17 00:00:00+00:00 60.0 1020.0 1200.0 63.0 0.0 19.0 2340 1260.0 late_nap 63237 NaN NaN
210 13.500 73.29 81.0 1680.0 2023-05-20 23:05:22+00:00 2023-05-20 22:17:22+00:00 2023-05-21 00:00:00+00:00 0.0 1770.0 1200.0 69.0 0.0 17.0 2880 1200.0 late_nap 62242 NaN NaN
233 14.500 70.20 64.0 1440.0 2023-06-03 23:11:00+00:00 2023-06-03 22:14:00+00:00 2023-06-04 00:00:00+00:00 1590.0 1320.0 390.0 64.0 0.0 13.0 3420 1980.0 NaN -20760 NaN NaN
251 14.250 59.73 123.0 1650.0 2023-06-11 21:39:31+00:00 2023-06-11 20:30:31+00:00 2023-06-11 00:00:00+00:00 900.0 600.0 1590.0 53.0 0.0 30.0 4140 2490.0 NaN 59431 NaN NaN
258 14.125 75.67 73.0 570.0 2023-06-17 21:25:32+00:00 2023-06-17 20:52:32+00:00 2023-06-17 00:00:00+00:00 150.0 150.0 1260.0 72.0 0.0 18.0 1980 1410.0 sleep 60752 NaN NaN
261 14.625 56.83 123.0 3090.0 2023-06-18 20:52:32+00:00 2023-06-18 19:36:32+00:00 2023-06-18 00:00:00+00:00 150.0 1770.0 1320.0 52.0 0.0 15.0 4560 1470.0 NaN 56192 NaN -0.13
264 15.375 NaN NaN 1980.0 2023-06-19 18:20:32+00:00 2023-06-19 17:33:02+00:00 2023-06-19 00:00:00+00:00 120.0 1920.0 750.0 NaN 0.0 29.0 2850 870.0 NaN 48782 NaN -0.14
298 16.875 NaN NaN 1141.0 2023-07-09 20:44:02+00:00 2023-07-09 20:07:01+00:00 2023-07-09 00:00:00+00:00 0.0 1140.0 1080.0 NaN 0.0 31.0 2221 1080.0 NaN 58021 NaN -0.03
322 13.375 NaN NaN 1050.0 2023-07-22 19:31:02+00:00 2023-07-22 19:02:02+00:00 2023-07-22 00:00:00+00:00 0.0 0.0 690.0 NaN 0.0 23.0 1740 690.0 NaN 54122 NaN 0.11
323 13.375 73.73 69.0 2070.0 2023-07-22 21:10:02+00:00 2023-07-22 19:36:02+00:00 2023-07-22 00:00:00+00:00 1170.0 1140.0 1800.0 70.0 600.0 72.0 5640 3570.0 NaN 56162 NaN 0.11
325 14.250 56.38 103.0 2070.0 2023-07-23 21:47:02+00:00 2023-07-23 20:21:32+00:00 2023-07-23 00:00:00+00:00 1200.0 1080.0 1860.0 53.0 0.0 19.0 5130 3060.0 NaN 58892 NaN 0.26
347 14.500 64.60 96.0 1260.0 2023-08-03 21:04:32+00:00 2023-08-03 20:16:02+00:00 2023-08-03 00:00:00+00:00 450.0 990.0 1200.0 62.0 0.0 3.0 2910 1650.0 NaN 58562 NaN -0.14
349 14.750 72.75 79.0 1260.0 2023-08-04 22:19:30+00:00 2023-08-04 21:28:30+00:00 2023-08-05 00:00:00+00:00 300.0 600.0 1500.0 66.0 0.0 32.0 3060 1800.0 late_nap 62910 NaN -0.02
353 14.125 NaN NaN 691.0 2023-08-06 20:23:34+00:00 2023-08-06 19:54:33+00:00 2023-08-06 00:00:00+00:00 0.0 390.0 1050.0 NaN 0.0 10.0 1741 1050.0 NaN 57273 NaN 0.10
377 16.250 NaN NaN 571.0 2023-08-20 22:38:00+00:00 2023-08-20 22:10:29+00:00 2023-08-20 00:00:00+00:00 90.0 420.0 960.0 NaN 30.0 0.0 1651 1080.0 NaN 61829 NaN 0.05
408 15.750 NaN NaN 1202.0 2023-09-04 23:57:04+00:00 2023-09-04 23:25:32+00:00 2023-09-05 00:00:00+00:00 60.0 0.0 630.0 NaN 0.0 23.0 1892 690.0 NaN -20068 NaN 0.16
452 17.875 89.67 21.0 1020.0 2023-09-25 16:37:57+00:00 2023-09-25 15:56:27+00:00 2023-09-26 00:00:00+00:00 210.0 690.0 1260.0 88.0 0.0 11.0 2490 1470.0 NaN 64587 NaN 0.06
455 15.375 79.80 NaN 1830.0 2023-09-27 16:11:57+00:00 2023-09-27 15:21:27+00:00 2023-09-28 00:00:00+00:00 30.0 0.0 1170.0 79.0 0.0 33.0 3030 1200.0 NaN 62487 NaN 0.18
463 16.375 NaN NaN 570.0 2023-09-30 21:22:29+00:00 2023-09-30 20:46:29+00:00 2023-10-01 00:00:00+00:00 210.0 300.0 1380.0 NaN 0.0 8.0 2160 1590.0 NaN -8011 NaN 0.14
508 13.375 61.17 101.0 420.0 2023-10-22 23:16:01+00:00 2023-10-22 22:30:01+00:00 2023-10-23 00:00:00+00:00 270.0 720.0 2010.0 58.0 60.0 22.0 2760 2340.0 late_nap 63001 NaN -0.28
522 13.875 62.00 92.0 2670.0 2023-10-27 21:00:04+00:00 2023-10-27 19:52:34+00:00 2023-10-27 00:00:00+00:00 60.0 2130.0 1230.0 59.0 90.0 16.0 4050 1380.0 NaN 53554 NaN 0.03
540 15.375 68.50 66.0 1380.0 2023-11-05 22:15:01+00:00 2023-11-05 21:33:01+00:00 2023-11-05 00:00:00+00:00 360.0 1380.0 780.0 66.0 0.0 31.0 2520 1140.0 sleep 55981 NaN -0.03
543 14.125 NaN NaN 841.0 2023-11-06 23:03:30+00:00 2023-11-06 22:33:59+00:00 2023-11-06 00:00:00+00:00 30.0 810.0 900.0 NaN 0.0 7.0 1771 930.0 NaN 59639 NaN NaN
590 15.500 90.60 NaN 1260.0 2023-11-30 20:52:33+00:00 2023-11-30 20:20:03+00:00 2023-11-30 00:00:00+00:00 0.0 0.0 690.0 86.0 0.0 23.0 1950 690.0 NaN 51603 NaN NaN
594 14.375 73.23 51.0 6150.0 2023-12-02 23:08:02+00:00 2023-12-02 20:21:02+00:00 2023-12-02 00:00:00+00:00 1230.0 4860.0 1800.0 66.0 840.0 0.0 10020 3870.0 NaN 51662 0.47 0.53
617 14.750 74.90 45.0 1830.0 2023-12-16 22:11:30+00:00 2023-12-16 21:04:30+00:00 2023-12-16 00:00:00+00:00 150.0 1260.0 2040.0 66.0 0.0 38.0 4020 2190.0 NaN 54270 NaN 0.10
654 14.500 57.82 123.0 1500.0 2024-01-02 23:22:30+00:00 2024-01-02 21:40:30+00:00 2024-01-02 00:00:00+00:00 720.0 990.0 2910.0 54.0 990.0 16.0 6120 4620.0 sleep 56430 NaN 0.27
722 14.625 63.80 82.0 2190.0 2024-02-03 21:12:02+00:00 2024-02-03 20:07:02+00:00 2024-02-03 00:00:00+00:00 60.0 1470.0 1650.0 56.0 0.0 27.0 3900 1710.0 sleep 50822 NaN -0.54
725 14.000 NaN NaN 3060.0 2024-02-04 22:54:30+00:00 2024-02-04 21:45:00+00:00 2024-02-04 00:00:00+00:00 0.0 2880.0 1110.0 NaN 0.0 0.0 4170 1110.0 NaN 56700 NaN -0.63
Back to Table of Contents¶

3.5.2. Aligning the Dataframes - Why workout today when there's always tomorrow?: ¶

Continuing to setup our feature dataframe, the day in which the biometric data is being collected will affect the next day. Currently, the sleep score is recorded on the day you wake on, meaning each date's score is referencing the previous night's sleep. This is important to note because the biometric features we're left with (e.g., calories burned, steps, heart rate, etc. in biometric_df) are being collected on the day in which a sleep score has already been recorded.

Logically this data, mostly indicating activity levels, is more relevant to the upcoming night's sleep, therefore, since we're organizing metrics per day, we need to shift this date 1 day to align with our model's target/response variable (sleep score). This will align the dataframes for when we merge them using the ['day'] column later.

Example:

Original

ScoreDay (sleep_df)Bedtime StartBedtime End//Day (biometric_df)StepsActive Calories
............//.........
8101/20/20241/19/2024 22:301/20/2024 07:00//01/20/202413000800
6501/21/20241/20/2024 23:451/21/2024 05:20//01/21/2024210001900
............//.........

New DF

Shifted biometric data to next day, aligning with relevant sleep score.

ScoreDay (sleep_df)Bedtime StartBedtime End//Day (biometric_df)/StepsActive Calories
............//... ⇩ ... ⇩ ...⇩
8101/20/20241/19/2024 22:301/20/2024 07:00//...⇩...⇩...⇩
6501/21/20241/20/2024 23:451/21/2024 05:20//01/20/2024 +1 ⇩13000 ⇩800 ⇩
............//01/21/2024 +1 ⇩21000 ⇩1900 ⇩

Thanks to https://tools.timodenk.com/markdown-table-to-html for markdown table to html conversion.

In [16]:
# Shift day one day into the future to align with relevant sleep score day.
biometric_df['day'] = biometric_df['day'] + pd.DateOffset(days = 1)
# Remove last day in biometric dataset since it won't have a sleep score with day shifting.
bio_max_idx = biometric_df.loc[biometric_df['day'] == biometric_df['day'].max()].index
biometric_df.drop(index = bio_max_idx, axis = 0, inplace = True)

# Verify each dataframe starts and ends on the same day.
print('Earliest Dates in each DF:')
print(biometric_df['day'].min(), '<-- biometric_df')
print(sleep_df['day'].min(), '<-- sleep_df')
print(daily_sleep_score_df['day'].min(), '<-- daily_sleep_score_df')
print('\nLatest Dates in each DF:')
print(biometric_df['day'].max(), '<-- biometric_df')
print(sleep_df['day'].max(), '<-- sleep_df')
print(daily_sleep_score_df['day'].max(), '<-- daily_sleep_score_df')
Earliest Dates in each DF:
2023-02-04 00:00:00+00:00 <-- biometric_df
2023-02-04 00:00:00+00:00 <-- sleep_df
2023-02-04 00:00:00+00:00 <-- daily_sleep_score_df

Latest Dates in each DF:
2024-02-06 00:00:00+00:00 <-- biometric_df
2024-02-06 00:00:00+00:00 <-- sleep_df
2024-02-06 00:00:00+00:00 <-- daily_sleep_score_df
Back to Table of Contents¶

3.5.3. Removing Days Without Sleep - No rest for the predicted: ¶

Now we should identify any days in which a sleep score wasn't recorded or accurate. Say because there wasn't enough sleep recorded, or maybe the ring wasn't worn during the night but still recorded biometric data for the day. We'll again use the long_sleep variable in ['type'] to identify days where sleep was detected and use it as a mask to identify days where no sleep was recorded.

In [17]:
# Identify days where there wasn't a long period of sleep
sleep_df_long = sleep_df.loc[sleep_df['type'] == 'long_sleep']

# Use merge() to find the differences in dataframes indicating days that didn't record a long sleep.
# Requires a method where the two df indexes don't match.
# Credit to: https://stackoverflow.com/questions/48647534/find-difference-between-two-data-frames
no_sleep = pd.DataFrame(sleep_df['day']).merge(sleep_df_long['day'], indicator = True, how='left').loc[lambda x : x['_merge']!='both']
print('Entries in which no long periods of sleep were recorded that day:')
display(no_sleep)

# We can remove these days with no long_sleep from the main dataframes as they won't have sleep scores anyways.
    # Sleep DF
sleep_df.drop(no_sleep.index, axis = 0, inplace = True)

    # Biometric DF
    # We can use our previous work and just remove the days that the two DFs don't share.
no_sleep_bio_days =  pd.DataFrame(biometric_df['day']).merge(sleep_df['day'], indicator = True, how='left').loc[lambda x : x['_merge']!='both']['day']
no_sleep_bio_index = biometric_df[biometric_df['day'].isin(no_sleep_bio_days)].index
biometric_df.drop(no_sleep_bio_index, axis = 0, inplace = True)

    # Daily Sleep Score DF
no_sleep_score_index = pd.DataFrame(daily_sleep_score_df['day']).merge(biometric_df['day'], indicator = True, how='left').loc[lambda x : x['_merge']!='both']['day'].index
daily_sleep_score_df.drop(index = no_sleep_score_index, axis = 0, inplace = True)
Entries in which no long periods of sleep were recorded that day:
day _merge
103 2023-03-24 00:00:00+00:00 left_only
140 2023-04-15 00:00:00+00:00 left_only
161 2023-05-01 00:00:00+00:00 left_only
162 2023-05-01 00:00:00+00:00 left_only
163 2023-05-02 00:00:00+00:00 left_only
173 2023-05-08 00:00:00+00:00 left_only
223 2023-05-28 00:00:00+00:00 left_only
224 2023-05-28 00:00:00+00:00 left_only
326 2023-07-24 00:00:00+00:00 left_only
402 2023-08-31 00:00:00+00:00 left_only
441 2023-09-21 00:00:00+00:00 left_only
442 2023-09-21 00:00:00+00:00 left_only
536 2023-11-05 00:00:00+00:00 left_only
537 2023-11-05 00:00:00+00:00 left_only
538 2023-11-05 00:00:00+00:00 left_only
539 2023-11-05 00:00:00+00:00 left_only
540 2023-11-05 00:00:00+00:00 left_only
588 2023-11-29 00:00:00+00:00 left_only
Back to Table of Contents¶

3.5.4. Binning a Target Variable?!: ¶

As we've seen, our target variable (response), sleep score is in the form of discrete integers. This data comes to us from Oura's own models in which the exact process of how this number was obtained has been obfuscated from us. There's an argument to be made whether or not we could treat this as a regression task or classification task, or whether there's ever merit in casting a regression task into the classification space, as transforming a target variable almost certainly introduces bias and causes information loss, but we will discuss that further in the conclusion section.

For now, since this project is just an exercise and since our dataset has relatively few samples, it might be beneficial to explore transforming the sleep score into categorical variable bins. We will keep the discrete sleep score column for use in our regression models, but create another few columns using different techniques that indicates a bin that score falls into for use in logistic regression/classification models.

Enough theory, let's discuss how we can accomplish binning any variable. We have a few different methods we can use to approach how to categorize sleep scores:

  • The simplest would be to create (n) bins and evenly space them.
    • Pros:
      • Easy to implement.
      • Handles data drift in the case of the addition of new data.
    • Cons:
      • Can create rather arbitrary boundaries and thus arbitrary class interpretation.
      • Likely to create heavily imbalanced classes.
      • Outliers are not addressed with this method.
  • Look at the distribution of the scores and use the quantiles or standard deviation to help find the bin intervals. (Sometimes called median-centered/quantile binning.)
    • Pros:
      • Distributes classes along the data distribution.
      • Limits the affect of outliers since bins will widen as they get further away from the median.
      • Increased edge and class interpretation.
      • Pandas has qcut() and cut() to help with implementation.
    • Cons:
      • Still need to decide how many bins to create which leads to potentially arbitrary boundaries.

Here we will implement a few versions of these binning techniques and analyze which makes the most sense for our sleep score data.

3.5.4.1. Even-Width Bins: ¶

Below, an even-width binning method has been implemented with 10 bins with a width of 10. An offset of 0.5 was used so the edges would automatically decide which bin to take since our sleep score data all hold integer values. Justifications for these parameters were fairly arbitrary, but seemed justifiable that there is a difference between a sleep score of 79 than, say, 88.

In [18]:
# Instead of having 0-100 potential label predictions when using a classification model,
# we should create a categorical 'binned' label column.
# Create bins. We can use .5 to easily decide left edge or right edge issues.
bin_number = 10
bin_interval_even = np.linspace(0.5, 100.5, bin_number + 1, dtype = float)
bin_labels = [f"{i} to {i + bin_number}" for i in bin_interval_even if i != bin_interval_even[-1]]
# np.digitize() can map the sleep scores to the correct bins
bin_idx = np.digitize(daily_sleep_score_df['score'], bin_interval_even, right = False)
# Create new column with the new bin labels
bin_map_even = [bin_labels[value - 1] for value in bin_idx]
daily_sleep_score_df['score_bin_even'] = bin_map_even
print(bin_labels)
display(daily_sleep_score_df['score_bin_even'].value_counts())

daily_sleep_score_df['score'].plot(kind='kde')
plt.xlim(0, 100)
plt.vlines(bin_interval_even, ymin = 0, ymax = 0.055, color = 'purple', alpha = 0.5)
for idx, label in enumerate(bin_labels):
    plt.text(x = ((bin_interval_even[idx] + bin_interval_even[idx + 1]) / 2) - 1.5,
             y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
             s = label,
             rotation = 'vertical',
             alpha = 0.5)
plt.xlabel('Sleep Score')
['0.5 to 10.5', '10.5 to 20.5', '20.5 to 30.5', '30.5 to 40.5', '40.5 to 50.5', '50.5 to 60.5', '60.5 to 70.5', '70.5 to 80.5', '80.5 to 90.5', '90.5 to 100.5']
score_bin_even
70.5 to 80.5     165
80.5 to 90.5     101
60.5 to 70.5      68
50.5 to 60.5       9
40.5 to 50.5       5
90.5 to 100.5      4
30.5 to 40.5       1
Name: count, dtype: int64
Out[18]:
Text(0.5, 0, 'Sleep Score')

You'll notice that once overlaid the distribution of the sleep score data, there appears to be a lot of wasted bins that don't capture any data. Logically you would think to increase the bin size, reducing the total number, but you quickly run into a different problem. If you increase the bin size you risk having the majority of the data land in only a few classes, and arguably that's a problem with the parameters chosen here where the majority of the data will land in a few bins causing sever class imbalance. However, decreasing the bin size will arguably make the interpretability of the class boundaries worse and increase the outlier effects.

Back to Table of Contents¶
3.5.4.2. Quantile/Median-Centered Bins: ¶

The distribution of scores are split into 5 quantiles (quintiles) meaning each bin will divide the range of data into 1/5 bins. The bins here will be labeled and distributed between 5 bins: ['Worst', 'Bad', 'Avg', 'Good', 'Best'].

Note: this and the follow method denotes an 'Avg' bin but in reality it is actually calculated using the median but they're close enough here that they effectively mean the same thing.

In [19]:
#labels_quantile = ['Worst', 'Bad', 'Below Avg','Above Avg', 'Good', 'Best']
labels_quantile = ['Worst', 'Bad', 'Avg', 'Good', 'Best']
bin_map_quantile, bin_interval_quantile = pd.qcut(daily_sleep_score_df['score'], 5, labels = labels_quantile, retbins = True)
daily_sleep_score_df['score_bin_quantile'] = bin_map_quantile

print('Quantile Bins:', bin_interval_quantile)
print('\nDistribution of Bins:', bin_map_quantile.value_counts())
daily_sleep_score_df['score'].plot(kind = 'kde')
plt.xlim(0, 100)
plt.vlines(bin_interval_quantile, ymin = 0, ymax = 0.055, color = 'purple', alpha = 0.5)
for idx, label in enumerate(labels_quantile):
    plt.text(x = ((bin_interval_quantile[idx] + bin_interval_quantile[idx + 1]) / 2) - 1.5,
             y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
             s = f'{label} - {bin_map_quantile.value_counts()[label]}',
             rotation = 'vertical',
             alpha = 0.5)
plt.xlabel('Sleep Score')
Quantile Bins: [36. 69. 75. 78. 82. 94.]

Distribution of Bins: score
Bad      84
Worst    73
Best     68
Good     66
Avg      62
Name: count, dtype: int64
Out[19]:
Text(0.5, 0, 'Sleep Score')

With very little parameter tweaking you can see the bin intervals overlaid on the distribution fits the data much better compared to the even-width method. This produced bins with reasonable balance that matches the data distribution. Another positive attribute of this method is the min and max interval starts at the min and max values of the data. You can see how this method will limit how outliers affect our classifications by way of naturally widening bins by a scalar of the standard deviation as they get further from the median value as there are less of those score values.

Back to Table of Contents¶
3.5.4.3. Custom Quantile Bins: ¶

The creation of this method isn't too dissimilar to the previous quantile bin implementation, but uses only the median and standard deviation to create each interval. Since this data is my own, the justifications were based on insights of my own perception of sleep quality (i.e., I believe my sleep generally skews towards poor sleep, therefore the bins and total number of classes in each should reflect that perception which is backed up here by the distribution's left tailedness.). In this method it was also determined to try 6 bins effectively splitting the median with the labels: ['Worst', 'Bad', 'Below Avg', 'Above Avg', 'Good', 'Best']

Note: No hard-coding was used to get these intervals, only the median and standard deviations so it will adjust when the data is updated, potentially shifting if my sleep consistently gets better.

In [20]:
labels_custom = ['Worst', 'Bad', 'Below Avg', 'Above Avg', 'Good', 'Best']
median_score = daily_sleep_score_df['score'].median()
std_score = daily_sleep_score_df['score'].std()
bin_interval_custom = [0,
            median_score - (1.5 * std_score),
            median_score - (0.5 * std_score),
            median_score,
            median_score + (0.5 * std_score),
            median_score + (1.5 * std_score),
            100]
bin_map_custom = pd.cut(daily_sleep_score_df['score'], bins = bin_interval_custom, labels = labels_custom)
daily_sleep_score_df['score_bin_custom'] = bin_map_custom

print('Custom Quantile Bins:', bin_interval_custom)
print('\nDistribution of Bins:', bin_map_custom.value_counts())
daily_sleep_score_df['score'].plot(kind='kde')
plt.xlim(0, 100)
plt.vlines(bin_interval_custom, ymin = 0, ymax = 0.055, color = 'purple', alpha = 0.5)
for idx, label in enumerate(labels_custom):
    plt.text(x = ((bin_interval_custom[idx] + bin_interval_custom[idx + 1]) / 2) - 1.5,
             y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
             s = f'{label} - {bin_map_custom.value_counts()[label]}',
             rotation = 'vertical',
             alpha = 0.5)
plt.xlabel('Sleep Score')
Custom Quantile Bins: [0, 64.39549639460978, 72.79849879820325, 77.0, 81.20150120179675, 89.60450360539022, 100]

Distribution of Bins: score
Below Avg    88
Good         78
Bad          77
Above Avg    75
Worst        30
Best          5
Name: count, dtype: int64
Out[20]:
Text(0.5, 0, 'Sleep Score')

This distribution of bins looks to match my personal domain knowledge that is my sleep quality, we'll see later how it performs.

Here we can take a look at a slice of the date with our new categorical versions of the target variable (y).

In [21]:
display(daily_sleep_score_df.tail(5))
score day score_bin_even score_bin_quantile score_bin_custom
350 68 2024-02-02 00:00:00+00:00 60.5 to 70.5 Worst Bad
351 76 2024-02-03 00:00:00+00:00 70.5 to 80.5 Avg Below Avg
352 75 2024-02-04 00:00:00+00:00 70.5 to 80.5 Bad Below Avg
353 83 2024-02-05 00:00:00+00:00 80.5 to 90.5 Best Good
354 83 2024-02-06 00:00:00+00:00 80.5 to 90.5 Best Good
Back to Table of Contents¶

3.5.5. Aggregating the Sleep Metrics: ¶

Finally, now that we've extracted the features we wanted from sleep_df, we can total the multiple sleep entries per day and then combine the sleep metrics with the biometric data.

Note: some metrics are not summed in this process, and of those, only the metrics tied to the 'long_sleep' entry will be used in the final dataset. (e.g., we don't really care what the average heart rate variability (['average_hrv']) or (['latency']) is for an interrupted sleep section during the night that only lasted 30 minutes.)

In [22]:
# Sum each day's sleep metrics (important for days with interrupted sleep or naps).
# Overwrites each day's entry.
sleep_df = sleep_df[['day', 
         'deep_sleep_duration', 
         'light_sleep_duration', 
         'rem_sleep_duration', 
         'restless_periods', 
         'awake_time', 
         'time_in_bed', 
         'total_sleep_duration']].groupby(['day']).sum().reset_index()

# Features extracted only from the long sleep to include in the final dateframe.
sleep_df_long = sleep_df_long[['day', 
                               'average_breath', 
                               'average_heart_rate', 
                               'average_hrv', 
                               'latency', 
                               'lowest_heart_rate', 
                               'bedtime_start_delta',
                               'nap_today']]
# Merge the summed features with the long_sleep-only features.
sleep_df = sleep_df.merge(sleep_df_long, on = 'day', how = 'outer')
sleep_df.head(2)
Out[22]:
day deep_sleep_duration light_sleep_duration rem_sleep_duration restless_periods awake_time time_in_bed total_sleep_duration average_breath average_heart_rate average_hrv latency lowest_heart_rate bedtime_start_delta nap_today
0 2023-02-04 00:00:00+00:00 4650.0 15570.0 5820.0 282.0 4440.0 30480 26040.0 13.625 63.25 77.0 990.0 56.0 -4778 0
1 2023-02-05 00:00:00+00:00 4590.0 14970.0 9210.0 240.0 6210.0 34980 28770.0 15.125 92.49 19.0 180.0 75.0 -332 0
Back to Table of Contents¶

3.5.6. Joining the Dataframes: ¶

Now that all feature cleaning, creation, and transformations are complete we should verify each dataframe matches shape for the upcoming merge.

In [23]:
print('Verify all DF.shape matches:', sleep_df.shape[0] == biometric_df.shape[0] == daily_sleep_score_df.shape[0])
print(sleep_df.shape, '<-- sleep_df')
print(biometric_df.shape, '<-- biometric_df')
print(daily_sleep_score_df.shape, '<-- daily_sleep_score_df')
Verify all DF.shape matches: True
(353, 15) <-- sleep_df
(353, 13) <-- biometric_df
(353, 5) <-- daily_sleep_score_df

Now, let's create one dataframe for both the scores and features, then display them.

In [24]:
# Merge sleep and biometric DFs.
bio_sleep_df = sleep_df.merge(biometric_df, on = 'day', how = 'outer')
# Merge sleep score df and feature df.
bio_sleep_df = daily_sleep_score_df.merge(bio_sleep_df, on = 'day', how = 'outer')
bio_sleep_df.reset_index(drop = True, inplace = True)

display(bio_sleep_df)
print(bio_sleep_df.columns)
score day score_bin_even score_bin_quantile score_bin_custom deep_sleep_duration light_sleep_duration rem_sleep_duration restless_periods awake_time ... low_activity_time medium_activity_time meters_to_target non_wear_time resting_time sedentary_time steps total_calories temperature_deviation spo2_percentage
0 81 2023-02-04 00:00:00+00:00 80.5 to 90.5 Good Above Avg 4650.0 15570.0 5820.0 282.0 4440.0 ... 8460 120 6300 36000 19680 22140 2718 2429 NaN NaN
1 81 2023-02-05 00:00:00+00:00 80.5 to 90.5 Good Above Avg 4590.0 14970.0 9210.0 240.0 6210.0 ... 23220 2940 0 0 25740 34380 11638 3075 -0.38 98.523
2 89 2023-02-06 00:00:00+00:00 80.5 to 90.5 Best Good 8580.0 15210.0 7350.0 302.0 6240.0 ... 15000 2580 2700 0 35280 33480 9001 2814 -0.04 97.181
3 78 2023-02-07 00:00:00+00:00 70.5 to 80.5 Avg Above Avg 4500.0 17700.0 6030.0 284.0 5010.0 ... 13440 2160 5300 0 31080 39600 7544 2748 -0.22 99.177
4 80 2023-02-08 00:00:00+00:00 70.5 to 80.5 Good Above Avg 4920.0 15630.0 6420.0 245.0 6210.0 ... 16560 3180 2300 4800 31020 30840 8668 2852 -0.25 98.634
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
348 68 2024-02-02 00:00:00+00:00 60.5 to 70.5 Worst Bad 4890.0 13800.0 4890.0 197.0 9753.0 ... 18000 5400 -4400 60 26700 36120 13714 3151 -0.22 96.706
349 76 2024-02-03 00:00:00+00:00 70.5 to 80.5 Avg Below Avg 5220.0 17790.0 4770.0 215.0 9180.0 ... 20820 3660 -3700 1020 36420 24360 13388 3092 -0.71 99.200
350 75 2024-02-04 00:00:00+00:00 70.5 to 80.5 Bad Below Avg 4980.0 13770.0 4950.0 203.0 9386.0 ... 15840 4260 -2800 3600 30780 31860 13186 3016 -0.60 98.131
351 83 2024-02-05 00:00:00+00:00 80.5 to 90.5 Best Good 7140.0 12990.0 7320.0 187.0 6501.0 ... 13260 2520 400 360 31680 38400 8380 2773 -0.48 98.595
352 83 2024-02-06 00:00:00+00:00 80.5 to 90.5 Best Good 5700.0 17700.0 5310.0 200.0 5767.0 ... 17820 2820 -300 1980 33480 30300 9970 2881 -0.14 98.291

353 rows × 31 columns

Index(['score', 'day', 'score_bin_even', 'score_bin_quantile',
       'score_bin_custom', 'deep_sleep_duration', 'light_sleep_duration',
       'rem_sleep_duration', 'restless_periods', 'awake_time', 'time_in_bed',
       'total_sleep_duration', 'average_breath', 'average_heart_rate',
       'average_hrv', 'latency', 'lowest_heart_rate', 'bedtime_start_delta',
       'nap_today', 'active_calories', 'high_activity_time',
       'low_activity_time', 'medium_activity_time', 'meters_to_target',
       'non_wear_time', 'resting_time', 'sedentary_time', 'steps',
       'total_calories', 'temperature_deviation', 'spo2_percentage'],
      dtype='object')
Back to Table of Contents¶

3.5.7. Dealing with Missing Data - NaNs: ¶

Lastly, we need to see what data is missing from our final dataframe and decide how to rectify any found.

In [25]:
# Explore and deal with NaN values.
print('Columns with NaNs present:', bio_sleep_df.columns[bio_sleep_df.isna().any()].to_list())
print(f"{'':<5}Total number of NaNs: {bio_sleep_df.isna().sum(axis = 1).sum()}")
Columns with NaNs present: ['temperature_deviation', 'spo2_percentage']
     Total number of NaNs: 48

48 NaN values were found only in the temperature_deviation and spo2_percentage columns. Without assuming the importance of these features, removing 48 rows of data just for these missing values seems a bad choice with the limited amount of data. Since the range of values doesn't vary too much in these features, as to not alter the distribution of the features, any NaN value will be replaced with the median value of that feature.

In [26]:
# A good option to deal with each feature individually.
# >>> values = {{"Feature 1": .median(), "Feature 2": 0, "Feature 3: .mean()}}
# >>> df.fillna(value=values)
# Only feature with NaN currently is SPO2 and temp. deviation, using median for now.
# TODO: Come back and tune each feature.
for column in bio_sleep_df:
    if bio_sleep_df[column].dtype == 'float64' or bio_sleep_df[column].dtype == 'int64':
        bio_sleep_df[column].fillna(bio_sleep_df[column].median(), inplace = True)
    else:
        continue
print(f'Number of NaN values remaining after adjustments: {bio_sleep_df.isna().sum(axis = 1).sum()}')
Number of NaN values remaining after adjustments: 0
Back to Table of Contents¶

4. Exploratory Data Analysis (EDA) ¶


4.1. High-Level Summary: ¶

First let's take a look at the high-level summary.

In [27]:
bio_sleep_df.describe()
Out[27]:
score deep_sleep_duration light_sleep_duration rem_sleep_duration restless_periods awake_time time_in_bed total_sleep_duration average_breath average_heart_rate ... low_activity_time medium_activity_time meters_to_target non_wear_time resting_time sedentary_time steps total_calories temperature_deviation spo2_percentage
count 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 ... 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000 353.000000
mean 75.597734 5265.042493 15382.776204 5475.042493 244.045326 6445.603399 32571.694051 26122.861190 13.884561 65.803088 ... 17263.682720 3128.838527 -1493.484419 2969.745042 31003.342776 31844.362606 10819.963173 2940.813031 0.007450 97.842986
std 8.403002 1667.128745 2500.618719 1456.503659 54.461654 2899.808704 4112.171365 3212.363992 0.481592 7.836005 ... 4813.519199 1855.161430 4482.521299 6154.118003 4777.300153 6612.383545 3447.932402 247.672000 0.216197 0.851728
min 36.000000 210.000000 9150.000000 1110.000000 88.000000 1590.000000 19740.000000 14040.000000 12.750000 49.400000 ... 900.000000 120.000000 -38700.000000 0.000000 8280.000000 5040.000000 1601.000000 2229.000000 -0.710000 92.072000
25% 71.000000 4320.000000 13710.000000 4590.000000 211.000000 4555.000000 30221.000000 24060.000000 13.625000 60.240000 ... 13980.000000 2160.000000 -3700.000000 0.000000 28800.000000 27360.000000 8522.000000 2802.000000 -0.110000 97.472000
50% 77.000000 5130.000000 15300.000000 5490.000000 239.000000 5940.000000 32460.000000 26280.000000 13.875000 64.170000 ... 16860.000000 2820.000000 -800.000000 360.000000 31380.000000 32640.000000 10388.000000 2917.000000 0.020000 98.027000
75% 81.000000 6120.000000 17040.000000 6390.000000 280.000000 7560.000000 35345.000000 28080.000000 14.125000 69.730000 ... 19920.000000 3720.000000 1100.000000 3480.000000 33900.000000 36660.000000 12708.000000 3060.000000 0.120000 98.419000
max 94.000000 10200.000000 21690.000000 9420.000000 492.000000 27456.000000 51036.000000 35520.000000 15.875000 96.680000 ... 36240.000000 24660.000000 8900.000000 51840.000000 49260.000000 50160.000000 23382.000000 5036.000000 1.420000 99.200000

8 rows × 27 columns

  • score seems to be slightly left-skewed with the mean of 54.59 below the median of 77.00.
    • The distribution also seems to be highly concentrated around the median, giving it a high kurtosis value.
  • deep_sleep_duration and rem_sleep_duration have similar data distributions.
  • light_sleep_duration looks to be the most common type of sleep when compared to the others.
  • steps averages ~ 10819 per day which is something to be happy about.
Pair Plots:¶

27 features is too many to plot a full pairwise plot so a few notable ones will be selected.

The binned sleep scores developed earlier can be used as a convenient way to potentially visualize clusters of data with respect to sleep score. First we can plot features that are directly related to sleep metrics.

In [28]:
pair_columns_sleep = ['score', 'score_bin_custom', 'deep_sleep_duration', 
                'light_sleep_duration', 'rem_sleep_duration', 
                'latency', 'bedtime_start_delta']
sns.pairplot(bio_sleep_df[pair_columns_sleep], hue = 'score_bin_custom')
Out[28]:
<seaborn.axisgrid.PairGrid at 0x177bebc10>

Grouping and coloring by score really assists in quickly identifying relationships.

For instance:

  • deep_sleep_duration vs score shows the intuitive relationship between amount of deep sleep and sleep score. Blue, or "Worst" sleep, is concentrated on the lower left of the data indicating the least amount of deep sleep, while purple. or "Best" sleep, is mostly on the upper right.
  • Also we can see some correlations between the types of sleep.
  • The variability of latency decreases as sleep score goes up.
In [29]:
pair_columns_sleep2 = ['score', 'score_bin_custom', 'restless_periods', 
                'awake_time', 'average_breath', 'average_hrv',
                 'nap_today']
sns.pairplot(bio_sleep_df[pair_columns_sleep2], hue = 'score_bin_custom')
Out[29]:
<seaborn.axisgrid.PairGrid at 0x2896682d0>
  • nap_today shows an interesting distribution where if a nap was taken it was likely either a bad sleep score day or good sleep score day.
  • A slight correlation can be seen in restless_periods vs score.

Now let's plot a few features that are related to daily activity levels to see how they relate to sleep score and each other.

In [30]:
pair_columns_active = ['score', 'score_bin_custom',
                'high_activity_time','low_activity_time', 'medium_activity_time',
                'resting_time', 'sedentary_time']
sns.pairplot(bio_sleep_df[pair_columns_active], hue = 'score_bin_custom')
Out[30]:
<seaborn.axisgrid.PairGrid at 0x28d9dbed0>
Back to Table of Contents¶

4.2. EDA Helper Functions: ¶

Majority of the time data is in seconds, which is hard to interpret. Creating a seconds to hours converter will be helpful for EDA.

In [31]:
# Function to transform inputs from seconds to hours.
def sec_to_hours(seconds):
    return seconds / 60 / 60

def reg_scatter_plot(feat_list):
    """Prints to output (n x 2) subplot array of scatter plots where x = feature in inputted list vs y = the target variable "score". 
    Can handle any number of input models.

    Parameters:
        features (list): Iterable of feature names from bio_sleep_df in the form of a string.
    
    Returns:
        None. Prints plot to output."""
    
    fig, ax = plt.subplots(int(len(feat_list)/2), 2, figsize = (6, 6), sharey = True)
    for i, feat in enumerate(feat_list):

        # Create regression line
        m, b = np.polyfit(bio_sleep_df[feat], bio_sleep_df['score'], 1)

        # i iterates through rows and columns with // and % operators.
        ax[i//2 , i%2].scatter(bio_sleep_df[feat], bio_sleep_df['score'], alpha = 0.5, s = 15)
        ax[i//2 , i%2].plot(bio_sleep_df[feat], m*bio_sleep_df[feat] + b, color = 'purple')
        ax[i//2 , i%2].set_title(feat)

    fig.suptitle('Sleep Score vs Types of Sleep')
    fig.supxlabel('Duration (seconds)')
    fig.supylabel('Sleep Score')
    fig.tight_layout()
    
    return None
Back to Table of Contents¶

4.3. Visualizing the Sleep Metrics: ¶

Let's see how the summary of the direct sleep metrics look when converted to hours.

In [32]:
print('Types of Sleep in Hours:')
sec_to_hours(bio_sleep_df[['deep_sleep_duration', 
                           'light_sleep_duration', 
                           'rem_sleep_duration', 
                           'total_sleep_duration']].describe())
Types of Sleep in Hours:
Out[32]:
deep_sleep_duration light_sleep_duration rem_sleep_duration total_sleep_duration
count 0.098056 0.098056 0.098056 0.098056
mean 1.462512 4.272993 1.520845 7.256350
std 0.463091 0.694616 0.404584 0.892323
min 0.058333 2.541667 0.308333 3.900000
25% 1.200000 3.808333 1.275000 6.683333
50% 1.425000 4.250000 1.525000 7.300000
75% 1.700000 4.733333 1.775000 7.800000
max 2.833333 6.025000 2.616667 9.866667

Much easier to interpret now.

  • Deep and REM sleep average about 1.5 hours per night.
  • Light sleep accounts for about 4.3 hours.
  • Total amount of sleep averages 7.25 for the last year.

Going further, we can visualize the total_sleep_duration distribution.

In [33]:
x_min = min(sec_to_hours(bio_sleep_df['total_sleep_duration']))
x_max = max(sec_to_hours(bio_sleep_df['total_sleep_duration']))
sec_to_hours(bio_sleep_df['total_sleep_duration']).plot(kind = 'kde')
plt.title('Distribution of Total Amount of Sleep per Day')
plt.xlabel('Hours of Sleep')
plt.xticks(np.arange(round(x_min - 1), round(x_max + 1), 0.5))
plt.xlim(round(x_min - 1), round(x_max + 1))
plt.grid(alpha = 0.4)

It might be interesting to see if there's any changes in sleep throughout the year.

In [34]:
ax = sns.boxplot(x = bio_sleep_df['day'].dt.month, y = daily_sleep_score_df['score'])
ax.set(title = 'Sleep Score by Month Boxplot', xlabel = 'month')
plt.show()

Majority of the sleep score outliers appear during the winter months, though we can see the variance increase in August and September.

Let's try to plot another feature in the same way that probably affects sleep score.

In [35]:
ax = sns.boxplot(x = bio_sleep_df['day'].dt.month, y = bio_sleep_df['awake_time'])
ax.set(title = 'Awake Time by Month Boxplot', xlabel = 'month')
plt.show()

Recall, awake_time tracks the amount of time in bed but not asleep.

Interestingly, it is easy to identify that there is a negative correlation between awake_time and sleep score. Whenever awake time increases, for example in May (5), sleep score decreases during the same month.

Some of the features that likely have the biggest effect on sleep score are probably going to be features that measure sleep duration. A plot of sleep score against sleep durations would be useful at determining by how much.

In [36]:
sleep_features = ['total_sleep_duration', 'deep_sleep_duration', 'rem_sleep_duration', 'light_sleep_duration']

reg_scatter_plot(sleep_features)

R.E.M. and Deep sleep seem to have the biggest positive correlation to sleep score. Light also does but is much less obvious than the others. Total sleep has the biggest but since it's just a linear combination of the others, that makes sense (more on this later).

Back to Table of Contents¶

4.4. Remove Day Column: ¶

We're done exploring the data in relation to time so the ['day'] column is no longer necessary, it will be dropped from the dataframe from here on out.

In [37]:
# Remove day column in preparation for model training.
bio_sleep_df.drop(['day'], axis = 1, inplace = True)
Back to Table of Contents¶

4.5. Correlation Matrix: ¶

It's time to look at each feature in relation to each other and determine if there will be any correlation/collinearity issues using this feature set in the model fitting. This will aid in the process of removing features. It will also give us the first quantitative, though limited, look at how the independent variables (predictor features) are correlated to the dependant (target) variable.

In [38]:
# Correlation
# Include score column but drop categorical bins to see correlation between response and predictors too.
corr_matrix = bio_sleep_df.drop(['score_bin_even', 'score_bin_quantile', 'score_bin_custom'], axis = 1).corr()
# Plot matrix.
fig, ax = plt.subplots(figsize = (16, 16))
sns.heatmap(corr_matrix, ax = ax, annot = True, fmt = '.1f')
Out[38]:
<Axes: >

With the correlation matrix we can identify potentially important features if they have high correlation with sleep score.

  • All 3 types of sleep durations (deep, light, rem) seem to be correlated with sleep score.
  • total_sleep_duration
  • restless_periods
  • awake_time
  • latency
  • bedtime_start_delta

Moving past what is just correlated to score we also need to be aware of any corelation between features which will introduce collinearity between predictors/features into our models. In most cases, this reduces the effectiveness in a model.

  • total_sleep_duration is logically highly correlated to the various types of other sleep durations and other features directly related to sleep. This is due to total sleep being a linear combinations of the others. (Adding deep, light, rem sleep together equals total sleep.) It seems better practice to keep the 3 specific types of sleep as they contain more specific information.
  • time_in_bed also has this issue for the same reasons.
  • lowest_heart_rate, average_hrv, and a few others also have high correlation between them.

Some or many of these features will have to be removed, but in most cases we only want to remove one feature out of the pairs that show correlation.

Back to Table of Contents¶

4.6. Analyzing Bedtime Delta - The moving target of sleep: ¶

Before removing features let's look at one that showed a relatively small correlation with sleep score. From the correlation matrix we can see that bedtime_start_delta has a correlation of -3 with score. Let's analyze this further to try to determine if there is significance in keeping it for our model.

Recall, that the bedtime start delta feature is calculated by taking the difference of the start of sleep and the 'regular' time of sleep, taking the baseline to be an average over the past two weeks since the user's bedtime may shift over time.

In [39]:
bio_sleep_df[['bedtime_start_delta']].describe()
Out[39]:
bedtime_start_delta
count 353.000000
mean -1218.929178
std 3302.654300
min -14398.000000
25% -3177.000000
50% -1109.000000
75% 629.000000
max 13351.000000
In [40]:
plt.scatter(bio_sleep_df['bedtime_start_delta'], bio_sleep_df['score'], alpha = 0.5)
plt.xlabel('Bedtime Start Delta (seconds)')
plt.ylabel('Sleep Score')
Out[40]:
Text(0, 0.5, 'Sleep Score')

Again, bedtime_start_delta has a correlation score of -0.3 in relation to sleep score. Looking at this graph it is a little difficult to see this correlation, but there does seem to be a slight negative trend.

Also a negative correlation makes logical sense here as:

  • If bedtime_start_delta = 0 that indicates a bedtime that aligns with the 'regular' baseline bedtime.
  • As bedtime_start_delta increases in the positive direction it indicates sleep was initiated past the 'regular' bedtime. The graph shows a negative trend as this happens.
  • As bedtime_start_delta decreases in the negative direction it indicates sleep was initiated before the 'regular' bedtime. The graph shows a very slight positive trend as this happens, but is harder to see than the previous change.

The intuition is that going to bed early would increase quality of sleep and thus sleep score, and conversely the opposite for a late bedtime.

Back to Table of Contents¶

4.6.1. Hypothesis Testing: ¶

Given this slightly vague correlation, it would be beneficial to test the effects of bedtime_start_delta on sleep score using a hypothesis test.

First, let's define what constitutes an early or late bedtime (corresponding to a negative or positive bedtime_start_delta)

  • Early Bedtime = bedtime_start_delta < 0

  • Late Bedtime = bedtime_start_delta > 0

We will then group the sleep score by the two bedtimes (early or late) and perform a hypothesis test to determine if the samples of sleep scores have a different population means.

Let's begin by checking to see what the sample mean of these two sleep score distributions are.

In [41]:
late_bedtime_scores = bio_sleep_df[bio_sleep_df['bedtime_start_delta'] > 0]['score']
early_bedtime_scores = bio_sleep_df[bio_sleep_df['bedtime_start_delta'] < 0]['score']

# Sample mean checks.
print(f"The sample mean of sleep scores considered to have late bedtimes: {late_bedtime_scores.mean()}")
print(f"\nThe sample mean of sleep scores considered to have early bedtimes: {early_bedtime_scores.mean()}\n")
The sample mean of sleep scores considered to have late bedtimes: 72.30630630630631

The sample mean of sleep scores considered to have early bedtimes: 77.10743801652893

Back to Table of Contents¶
4.6.1.1 Two-sample t-test Hypothesis: ¶

Null Hypothesis - $H_0: \mu_{score | early bedtime} = \mu_{score | late bedtime}$

Alternate Hypothesis - $H_1: \mu_{score | early bedtime} \not= \mu_{score | late bedtime}$

All tests will be done with a significance level ($\alpha$): 0.05

Note: Before we can perform this two-sample t-test, we need to confirm some assumptions that are required by the test about the data.

  • Independent: The data values are independent, one data point does not affect any other data points. (This is dubious as a case could be made that the sleep of one night could be dependant on the previous night, but we'll make the assumption it isn't here.)
  • Normality: The data from each group is normally distributed.
  • Equal Variance: The two independent groups have equal variance.
In [42]:
# Set level of significance.
alpha = 0.05
Back to Table of Contents¶
4.6.1.2 Checking for Normality: ¶
In [43]:
fig, ax = plt.subplots(2, 2, figsize = (6, 6))
sns.histplot(late_bedtime_scores, kde = True, ax = ax[0, 0])
sm.qqplot(late_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[0, 1])
ax[0,0].set_title('Late Bedtime Score: Density')
ax[0,1].set_title('Late Bedtime Score: QQ Plot')
sns.histplot(early_bedtime_scores, kde = True, ax = ax[1, 0])
sm.qqplot(early_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[1, 1])
ax[1,0].set_title('Early Bedtime Score: Density')
ax[1,1].set_title('Early Bedtime Score: QQ Plot')

fig.tight_layout()
plt.show()

Looking at both the distributions and QQ plots it looks like they are almost normally distributed but not quite. The distributions are both left-skewed (negative-skewed) and the QQ plot tails curve off the 45 degree line.

We can confirm this quantitatively with a Shapiro-Wilk test.

Null Hypothesis - $H_0: X \sim Normal(\mu, \sigma^2)$, the samples do come from a normal distribution.

Alternate Hypothesis - $H_1: X \nsim Normal(\mu, \sigma^2)$, the samples DO NOT come from a normal distribution.

In [44]:
# TODO: If data is updated the following is not setup if sample conditions change drastically.
# Let's test if our samples display normality with the shapiro-wilk test.
# Null hypothesis H_0: samples come from a normal distribution.
print('Early Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(early_bedtime_scores)
if p_value_sw > alpha:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples DO come from a normal distribution.")
else:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples do NOT come from a normal distribution.")

print('\nLate Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(late_bedtime_scores)
if p_value_sw > alpha:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples DO come from a normal distribution.")
else:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples do NOT come from a normal distribution.")
Early Bedtime:
Shapiro-Wilk test (Normality): 
     with a p-value of 1.0500e-08, 
     we successfully reject the null hypothesis: samples are from a normally distributed population.

Samples do NOT come from a normal distribution.

Late Bedtime:
Shapiro-Wilk test (Normality): 
     with a p-value of 2.2177e-06, 
     we successfully reject the null hypothesis: samples are from a normally distributed population.

Samples do NOT come from a normal distribution.

Both groups rejected the null hypothesis that the data is from a normally distributed population.

To move on we can attempt to transform the data to fix this normality issue.

For a slight negative skew, a common transformation is:

$\sqrt{max(x+1) - x}$

Then for a larger skew, log base 10:

$\log_{10}{(max(x+1) - x)}$

Finally for even larger skews, inverse:

$\frac{1}{max(x+1) - x}$

Note: for a left-skew (positive) just remove the max() and subtraction which follows the pattern $\sqrt{x}$.

In [45]:
# Transformations
# Square root first
late_bedtime_scores = np.sqrt(max(late_bedtime_scores + 1) - late_bedtime_scores)
early_bedtime_scores = np.sqrt(max(early_bedtime_scores + 1) - early_bedtime_scores)

Now to perform the previous checks to see if the transformations made a difference.

In [46]:
fig, ax = plt.subplots(2, 2, figsize = (6, 6))
sns.histplot(late_bedtime_scores, kde = True, ax = ax[0, 0])
sm.qqplot(late_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[0, 1])
ax[0,0].set_title('Late Bedtime Score: Density')
ax[0,1].set_title('Late Bedtime Score: QQ Plot')
sns.histplot(early_bedtime_scores, kde = True, ax = ax[1, 0])
sm.qqplot(early_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[1, 1])
ax[1,0].set_title('Early Bedtime Score: Density')
ax[1,1].set_title('Early Bedtime Score: QQ Plot')

fig.tight_layout()
plt.show()

# Null hypothesis H_0: samples come from a normal distribution.
print('Early Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(early_bedtime_scores)
if p_value_sw > alpha:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples DO come from a normal distribution.")
else:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples do NOT come from a normal distribution.")

print('\nLate Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(late_bedtime_scores)
if p_value_sw > alpha:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples DO come from a normal distribution.")
else:
    print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
    print("\nSamples do NOT come from a normal distribution.")
Early Bedtime:
Shapiro-Wilk test (Normality): 
     with a p-value of 0.051339421421289444, 
     we fail to reject the null hypothesis: samples are from a normally distributed population.

Samples DO come from a normal distribution.

Late Bedtime:
Shapiro-Wilk test (Normality): 
     with a p-value of 0.06582079827785492, 
     we fail to reject the null hypothesis: samples are from a normally distributed population.

Samples DO come from a normal distribution.

Great, the transformations did the trick, but just barely. Judging by these and the previous plots it looks like there are a few outliers in the tails of the data that are also causing havoc. We'll leave them in for now, but will address them before building our models.

Back to Table of Contents¶
4.6.1.3 Checking for Variance Equivalence: ¶

Since we've now proved our data comes from a normal distribution that also satisfies the use of an F-test to determine if our two groups have equal variances. (If normality was not confirmed the Levene test could be performed instead, but the t-test could not anyways.)

F-test for Equality of Variances (Two-Tailed Version)

Null Hypothesis - $H_0: \sigma_{score | early bedtime}^{2} = \sigma_{score | late bedtime}^{2}$

Alternate Hypothesis - $H_1: \sigma_{score | early bedtime}^{2} \not= \sigma_{score | late bedtime}^{2}$

In [47]:
# Calculate variance of each group.
s1 = np.var(late_bedtime_scores, ddof = 1) # ddof = 1 to calculate sample var.
s2 = np.var(early_bedtime_scores, ddof = 1)

# Null hypothesis H_0: Var(late) == Var(early)
# Larger var needs to be on top.
if s2 > s1:
    f_stat = s2 / s1
    df_num = len(early_bedtime_scores) - 1 # numerator
    df_denom = len(late_bedtime_scores) - 1 # denominator
else:
    f_stat = s1 / s2
    df_num = len(late_bedtime_scores) - 1 # numerator
    df_denom = len(early_bedtime_scores) - 1 # denominator

p_value_f = 1 - stats.f.cdf(f_stat, df_num, df_denom)
if p_value_f > alpha:
    print(f"F-test (variance): \n{'':<5}with a p-value of {p_value_f}, \n{'':<5}we fail to reject the null hypothesis: Var(late) == Var(early)")
    print("\nPopulations HAVE equal variances.")
else:
    print(f"F-test (variance): \n{'':<5}with a p-value of {p_value_f}, \n{'':<5}we successfully reject the null hypothesis: Var(late) == Var(early)")
    print("\nPopulations DO NOT have equal variances.")
F-test (variance): 
     with a p-value of 0.4487558615463687, 
     we fail to reject the null hypothesis: Var(late) == Var(early)

Populations HAVE equal variances.

Great, this shows that we come move forward with the two-sample t-test to see if these two groups have equal means or not.

Back to Table of Contents¶
4.6.1.4 Performing a Two-sample t-test: ¶

Reminder,

Null Hypothesis - $H_0: \mu_{score | early bedtime} = \mu_{score | late bedtime}$

Alternate Hypothesis - $H_1: \mu_{score | early bedtime} \not= \mu_{score | late bedtime}$

In [48]:
# Finally let's perform a t-test to see if the mean scores are equal or not depending on the sleep time.
# Null hypothesis H_0: mean(late) == mean(early)
t_stat_t, p_value_t = stats.ttest_ind(late_bedtime_scores, early_bedtime_scores, alternative = 'two-sided')
if p_value_t > alpha:
    print(f"Two-Sample t-test (mean): \n{'':<5}with a p-value of {p_value_t}, \n{'':<5}we fail to reject the null hypothesis: mean(late) == mean(early)")
    print(f"\nThe means of the two samples are equal!")
else:
    print(f"Two-Sample t-test (mean): \n{'':<5}with a p-value of {p_value_t}, \n{'':<5}we successfully reject the null hypothesis: mean(late) == mean(early)")
    print(f"\nThe two sleep score population means are NOT equal!")
Two-Sample t-test (mean): 
     with a p-value of 0.013629025835902385, 
     we successfully reject the null hypothesis: mean(late) == mean(early)

The two sleep score population means are NOT equal!

This hypothesis test shows that the population mean sleep scores of the two bedtime_start_delta sets (late and early bedtimes) are different.

$\mu_{score | early bedtime} \not= \mu_{score | late bedtime}$

We can extrapolate that to say bedtime_start_delta does have some effect on sleep score and is a feature worth adding in the models.

Back to Table of Contents¶

4.7. OLS Multiple Linear Regression as a Diagnostic Tool - Pre-Feature Removal: ¶

Before removing any features let's fit a linear regression model to ['score'] as the target variable (response). This should eventually help us identify important features and give a good idea about the severity of multicollinearity between predictors.

In [49]:
linear_clf = sm.OLS(bio_sleep_df.iloc[:, :1], sm.add_constant(bio_sleep_df.iloc[:, 4:])).fit()
linear_clf.summary()
Out[49]:
OLS Regression Results
Dep. Variable: score R-squared: 0.885
Model: OLS Adj. R-squared: 0.877
Method: Least Squares F-statistic: 101.0
Date: Fri, 01 Mar 2024 Prob (F-statistic): 1.30e-137
Time: 13:15:23 Log-Likelihood: -869.51
No. Observations: 353 AIC: 1791.
Df Residuals: 327 BIC: 1892.
Df Model: 25
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4.3622 28.532 0.153 0.879 -51.768 60.493
deep_sleep_duration -0.0005 0.001 -0.550 0.583 -0.002 0.001
light_sleep_duration -0.0014 0.001 -1.458 0.146 -0.003 0.000
rem_sleep_duration -5.281e-05 0.001 -0.056 0.955 -0.002 0.002
restless_periods -0.0221 0.004 -5.402 0.000 -0.030 -0.014
awake_time -0.0053 0.004 -1.426 0.155 -0.013 0.002
time_in_bed 0.0047 0.004 1.253 0.211 -0.003 0.012
total_sleep_duration -0.0019 0.003 -0.692 0.489 -0.007 0.004
average_breath 0.3119 0.478 0.653 0.514 -0.628 1.252
average_heart_rate -0.0980 0.071 -1.389 0.166 -0.237 0.041
average_hrv -0.0070 0.014 -0.487 0.626 -0.035 0.021
latency -0.0021 0.000 -12.681 0.000 -0.002 -0.002
lowest_heart_rate 0.0248 0.060 0.413 0.680 -0.093 0.143
bedtime_start_delta -0.0006 6.11e-05 -9.205 0.000 -0.001 -0.000
nap_today -0.6323 0.557 -1.134 0.257 -1.729 0.464
active_calories -0.0025 0.010 -0.257 0.797 -0.022 0.017
high_activity_time 0.0005 0.001 0.329 0.742 -0.002 0.003
low_activity_time -0.0001 0.000 -0.642 0.521 -0.001 0.000
medium_activity_time -0.0002 0.000 -0.490 0.624 -0.001 0.001
meters_to_target -4.253e-05 0.000 -0.416 0.678 -0.000 0.000
non_wear_time 7.165e-05 0.000 0.565 0.572 -0.000 0.000
resting_time 0.0001 0.000 1.042 0.298 -0.000 0.000
sedentary_time 6.74e-05 0.000 0.490 0.624 -0.000 0.000
steps 2.323e-05 0.000 0.195 0.845 -0.000 0.000
total_calories 0.0065 0.009 0.719 0.473 -0.011 0.024
temperature_deviation 0.3037 0.828 0.367 0.714 -1.325 1.932
spo2_percentage 0.1547 0.210 0.736 0.462 -0.259 0.568
Omnibus: 18.471 Durbin-Watson: 1.919
Prob(Omnibus): 0.000 Jarque-Bera (JB): 23.825
Skew: -0.438 Prob(JB): 6.71e-06
Kurtosis: 3.924 Cond. No. 1.00e+16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.59e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Unsurprisingly, this shows a deeply flawed model when using all available features. There is a huge collinearity and multicollinearity issue between the predictor features currently fitted into this model indicated by the Condition Number = 1.00e+16. Also, while the adjust R2 and F-stat p-values indicate a well-fitted model, we can see the feature's t-test p-values indicate almost no features show significance, but this is likely only due to the previously mentioned collinearity and multicollinearity issues.

We will remove redundant and insignificant features from our dataset and refit this model after to see how our changes affect the model's performance.

A quick sanity check for whether this data will work is compatible with linear regression methods is to analyze the residuals. Here we can plot the residuals against the fitted values of the model and confirm whether there is a linear relationship or heteroskedasticity (non-equal variance) issues.

In [50]:
plt.scatter(linear_clf.fittedvalues, linear_clf.resid, alpha = 0.5)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('MLR Residuals vs Fitted Values')
plt.hlines(y = 0, xmin = 40, xmax = 100, colors = 'black', linestyles='--')
plt.show()

This looks reasonable for checking the boxes for linear regression. While there are certainly outliers to deal with, no indication of non-linear relationships or funnelling of the points that would indicate the issues mentioned above.

Back to Table of Contents¶

5. Feature Selection/Removal ¶


Now that we've done the work to understand the data, we can use what we've learned above to make decisions on which features to keep or remove.

In [51]:
# Remove sign on correlation score for comparison.
# TODO: This method arbitrarily selects the first feature (of the two with a high score)
#   that it comes across. Change this method to have an insight-based selection method.
#   Possibly a preference list.
corr_matrix_abs = corr_matrix.abs()
# Drop score from x and y axis.
corr_matrix_abs.drop('score', axis = 0, inplace = True)
corr_matrix_abs.drop('score', axis = 1, inplace = True)
# Mask off only one of the correlation triangles to avoid duplicates.
corr_triangle = corr_matrix_abs.where(np.triu(np.ones(corr_matrix_abs.shape), k = 1).astype(bool))
# Create list of features with correlation higher than a threshold.
# Threshold set to > 0.75
to_drop = [column for column in corr_triangle.columns if any(corr_triangle[column] > 0.7)]
print(to_drop)
['total_sleep_duration', 'average_hrv', 'lowest_heart_rate', 'medium_activity_time', 'meters_to_target', 'steps', 'total_calories']
In [52]:
# For now let's manually remove features (see TODO note above).
choices_to_drop = [('average_hrv', 'average_heart_rate'),
           ('lowest_heart_rate', 'average_hrv'),
           ('medium_activity_time', 'active_calories'),
           ('meters_to_target', 'active_calories'),
           ('steps', 'a lot'),
           ('total_calories', 'a lot')]

bio_sleep_df.drop(['lowest_heart_rate', 
                   'active_calories', 
                   'meters_to_target', 
                   'steps', 
                   'total_calories', 
                   'average_heart_rate', 
                   'time_in_bed', 
                   'total_sleep_duration'],
                  axis = 1, inplace = True)

Re-plot the correlation matrix to see changes.

In [53]:
# Correlation
# Include score column to see correlation between response and predictors too.
corr_matrix = bio_sleep_df.drop(['score_bin_even', 'score_bin_quantile', 'score_bin_custom'], axis = 1).corr()
# Plot matrix.
fig, ax = plt.subplots(figsize = (16, 16))
sns.heatmap(corr_matrix, ax = ax, annot = True, fmt = '.1f')
Out[53]:
<Axes: >

After feature removal this correlation matrix looks much more appropriate for model fitting. Most of the highly correlated features have been selectively removed and as many as could be left that are highly correlated with score remain. It also indicates there are likely more features to be removed that are not going to be involved in modeling sleep score.

Subset Feature Selection / Backwards Feature Selection starting from all features.

In [54]:
# TODO: Implement automated feature selection.
Back to Table of Contents¶

6. Outlier Removal ¶

We've seen a few outliers during the EDA that could potentially cause issues in our models. Especially in classification models where the class examples are sparse.

The class bins created earlier can be used to quickly get an idea of where these outliers are, at least in the target variable.

In [55]:
# Identify any score_bin_even where that class n == 1
display(bio_sleep_df['score_bin_even'].value_counts())
one_class = (bio_sleep_df['score_bin_even'].value_counts() == 1)
one_class_drop = (one_class.loc[one_class == True].index)
score_bin_even
70.5 to 80.5     165
80.5 to 90.5     101
60.5 to 70.5      68
50.5 to 60.5       9
40.5 to 50.5       5
90.5 to 100.5      4
30.5 to 40.5       1
Name: count, dtype: int64

Remove any sample which only has a class count of 1. This is also necessary if any class balancing during the train/test split is performed.

In [56]:
# Remove outlier for now so we can class-balance the splits (min per class required = 2)
for value in one_class_drop:
    class_outlier_loc = bio_sleep_df.loc[bio_sleep_df['score_bin_even'] == value]
    bio_sleep_df.drop(class_outlier_loc.index, axis = 0, inplace = True)
    
bio_sleep_df.reset_index(drop = True, inplace = True)
Back to Table of Contents¶

7. Train Test Split ¶


Since we'll be using sklearn's StandardScaler() to standardize, we need to do the split at the same time so that we can properly separate the training information from the test set during standardization, preventing data leakage.

In total we will have 3 types of train / test splits to use on our models, but they will be split identically. We can also use the stratify parameter to class balance the training and test sets which will help generalize models training on this fairly sparse dataset.

Note: It would also be better practice to split the data before EDA to prevent data leakage, but this was not necessary for the scope of this project.

In [57]:
rand_state = 87654321 #123456789 #1111 # Used in all models that accept a random state.
test_ratio = 0.2 
bin_type = 'score_bin_custom'

# Discretized Scores
# Stratify-balance the classes with the bin column (note this will slightly alter regression scores too).
X_train, X_test, y_train, y_test = train_test_split(bio_sleep_df.iloc[:, 4:], bio_sleep_df.iloc[:, :4], 
                                                    test_size = test_ratio, 
                                                    random_state = rand_state)#, 
                                                    #stratify = bio_sleep_df['score_bin_custom'])

# Binned/Categorical Class Labels
y_train_bin, y_test_bin = y_train[bin_type], y_test[bin_type]
Back to Table of Contents¶

7.1. Standardize Variables - Created separately on the same split: ¶

Since we'll be using many machine learning models in which regularization techniques can be sensitive to differences in predictor variance magnitudes/scale (e.g., Lasso/Ridge Regression, SVM, etc.) and expect features to be roughly normally distributed, we will be standardizing features appropriately.

Standardization will be done with the following formula:

$$Z = \frac{(X - \bar{X})}{S}$$

Centers inputs around 0 with the sample mean, and scales them with the sample standard deviation.

In [58]:
# Centered and Standardized
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = scaler.get_feature_names_out())
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = scaler.get_feature_names_out())
# Squeeze y's to put into a series, same format as train_test_split
y_train_scaled = pd.DataFrame(scaler.fit_transform(np.array(y_train['score']).reshape(-1, 1)), columns = ['score_scaled']).squeeze(axis = 1)
y_test_scaled = pd.DataFrame(scaler.transform(np.array(y_test['score']).reshape(-1, 1)), columns = ['score_scaled']).squeeze(axis = 1)
Back to Table of Contents¶

8. Supervised Learning Model Selection ¶


Something we haven't discussed yet, besides in the realm of feature engineering, is whether or not regression or logistic regression makes more sense for this type of problem.

First thing to note, because this isn't a binary logistic classification problem since we've binned our target variable into multiple classes the way we handle this is important and the sklearn package handles each model a little differently. Methods we'll use to handle this multi-class, sometimes called multinomial, classification problem is to utilize One-vs-Rest, One-vs-One, and Multinomial Loss (aka Cross Entropy) techniques. These methods will handle this issue of multiple classes fine, but we need to ask what we're looking for out of this.

Given the predictor variables and expected prediction, are we expecting a:

  • Precise, continuous or discrete, sleep score predictions (Regression)?
    • How do we measure a successful prediction?
      • Is the standard error of each $\hat{y_i}$ within a range of the ground truth $y_i$.
      • Root Mean Squared Error (RMSE)
      • Mean Absolute Error (MAE)
  • Predicted classes of sleep quality given by binned sleep scores indicating different levels of 'Bad', 'Good', 'Better', 'Best' sleep.
    • How do we measure a successful prediction?
      • Accuracy
      • F1 Score
      • ROC Curve

It's very important to note that comparing regression and classification models is almost always a fools-errand in building a convincing metric that can relate between the two types of models. In practice, it's better to properly formulate the question beforehand and fit the models that better answer that question instead of fitting them all and attempting to compare things that can't be directly compared.

All that said, because this project is for exploration, let's go ahead and do just that by creating a few of each of regression and logistic regression models and discuss this further in the conclusion section.

8.1. Model Baseline Evaluation Metrics: ¶

An important thing to set before initializing models is an expected baseline to which we can compare performance. At the very least, our models have to improve on these metrics.

8.1.1. Classification Baseline: ¶

A useful metric for classification is Zero Rate which is a "model" that always predicts the most populous class.

In [59]:
# Calculate the random chance baseline for our classification models.
# 1 / number of classes
random_base_class = 1 / len(bin_map_custom.unique())
print(f"The random chance baseline for classification models is: {random_base_class}")
# Calculate the Zero Rate baseline for our classification models.
zero_rate_base_class = y_train_bin.value_counts().max() / len(y_train_bin)
print(f"The Zero Rate accuracy baseline for classification models that always chooses the most populous class is: {zero_rate_base_class}")
The random chance baseline for classification models is: 0.16666666666666666
The Zero Rate accuracy baseline for classification models that always chooses the most populous class is: 0.2491103202846975

8.1.2. Regression Baseline: ¶

A good baseline for regression models is to identify what metrics will be used to evaluate your regression models, then use the mean or median to measure a baseline of that metric.

For example, measure the Mean Absolute Error (MAE) if a "model" always predicted values of the median of the target variable (sleep score).

In [60]:
# Calculate the Zero Rate baseline for our classification models.
y_pred_baseline = np.full_like(y_train['score'], fill_value = y_train['score'].median(), dtype = float)
mae_baseline_regress = mean_absolute_error(y_train['score'], y_pred_baseline)
print(f"The baseline Mean Absolute Error (MAE) for regression models is: {mae_baseline_regress}")
The baseline Mean Absolute Error (MAE) for regression models is: 6.249110320284697
Back to Table of Contents¶

8.2. Model Helper Functions: ¶


In [61]:
class MidpointNormalize(Normalize):
    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))
    
def plotSearchGrid(grid):
    scores = [x for x in grid.cv_results_["mean_test_score"]]
    scores = np.array(scores).reshape(len(grid.param_grid['max_depth']), len(grid.param_grid['max_leaf_nodes']))

    plt.figure(figsize=(10, 8))
    plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
    plt.imshow(scores, interpolation='nearest', cmap=plt.cm.plasma,
               norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
    plt.xlabel('max_leaf_nodes')
    plt.ylabel('max_depth')
    plt.colorbar()
    plt.xticks(np.arange(len(grid.param_grid['max_leaf_nodes'])), grid.param_grid['max_leaf_nodes'], rotation=90)
    plt.yticks(np.arange(len(grid.param_grid['max_depth'])), grid.param_grid['max_depth'])
    plt.title('Validation Score')
    plt.show()
    
def plotSearchGridSVM(grid):
    scores = [x for x in grid.cv_results_["mean_test_score"]]
    scores = np.array(scores).reshape(len(grid.param_grid['C']), len(grid.param_grid['gamma']))

    plt.figure(figsize=(10, 8))
    plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
    plt.imshow(scores, interpolation='nearest', cmap=plt.cm.plasma,
               norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
    plt.xlabel('Gamma')
    plt.ylabel('C')
    plt.colorbar()
    plt.xticks(np.arange(len(grid.param_grid['gamma'])), grid.param_grid['gamma'], rotation=90)
    plt.yticks(np.arange(len(grid.param_grid['C'])), grid.param_grid['C'])
    plt.title('Validation Score')
    plt.show()
Back to Table of Contents¶

8.3. Model Initializations: ¶

In [62]:
# Initialize list of fitted model objects.
models_regression = []
models_classification = []

8.3.1. Linear/Logistic Regression: ¶

After standardization, feature, and outlier removal let's see how the metrics associated with the linear regression model has changed.

Note: in comparison with the previous regression model fit during the EDA section, which was only used for feature exploration, this model has been fit with only the training data split. Still, we can still draw some useful conclusions from this model

In [63]:
mod_linear_OLS = sm.OLS(y_train['score'].reset_index(drop = True), sm.add_constant(X_train_scaled)).fit()
models_regression.append(mod_linear_OLS)
mod_linear_OLS.summary()
Out[63]:
OLS Regression Results
Dep. Variable: score R-squared: 0.882
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 108.7
Date: Fri, 01 Mar 2024 Prob (F-statistic): 3.15e-110
Time: 13:15:24 Log-Likelihood: -696.78
No. Observations: 281 AIC: 1432.
Df Residuals: 262 BIC: 1501.
Df Model: 18
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 75.8826 0.178 425.234 0.000 75.531 76.234
deep_sleep_duration 3.6552 0.232 15.764 0.000 3.199 4.112
light_sleep_duration 3.4404 0.252 13.631 0.000 2.943 3.937
rem_sleep_duration 4.0485 0.192 21.048 0.000 3.670 4.427
restless_periods -1.2579 0.247 -5.083 0.000 -1.745 -0.771
awake_time -1.9036 0.210 -9.079 0.000 -2.316 -1.491
average_breath 0.0709 0.213 0.334 0.739 -0.348 0.489
average_hrv 0.3172 0.211 1.500 0.135 -0.099 0.734
latency -2.5018 0.214 -11.704 0.000 -2.923 -2.081
bedtime_start_delta -1.6799 0.224 -7.493 0.000 -2.121 -1.238
nap_today -0.0134 0.194 -0.069 0.945 -0.395 0.368
high_activity_time 0.2556 0.196 1.307 0.192 -0.130 0.641
low_activity_time 0.0520 0.583 0.089 0.929 -1.097 1.201
medium_activity_time 0.5161 0.321 1.610 0.109 -0.115 1.147
non_wear_time 0.3917 0.675 0.580 0.562 -0.938 1.721
resting_time 0.6315 0.526 1.200 0.231 -0.405 1.668
sedentary_time 0.6850 0.809 0.847 0.398 -0.908 2.278
temperature_deviation 0.0768 0.197 0.390 0.697 -0.311 0.465
spo2_percentage 0.1773 0.202 0.878 0.381 -0.221 0.575
Omnibus: 17.170 Durbin-Watson: 2.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 19.924
Skew: -0.525 Prob(JB): 4.72e-05
Kurtosis: 3.775 Cond. No. 11.4


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The expectation here was not to finely tune this model, but to use this model summary to help understand how the data preparation has effected the models.

  • At first glance major improvements have already been made by the feature removal, standardization, and outlier removal.
    • The adjusted R2 score along with the F-statistic p-value indicates a fairly well-fit model.
    • By looking at the t-test p-values we see many more p-values ~ 0.0 than before feature removal.
      • This indicates we're on the right track, but also there are still features that, at least for linear regression, could still be removed or fitted with interaction terms or introduce polynomial terms.
    • The Condition Number is one of the most interesting measures here which confirms how well the feature removal (and possibly standardization) process went. Looking back it has come down from 1.00e+16 to 11.4.
      • This indicates the features removed eliminated most of the multicollinearity between the remaining features.
    • The Durbin-Watson statistic is 2.142 which is important here because autocorrelation was a concern with this dataset being measured as a time-series of sorts. 2.142 being so close to 2.0 indicates very little autocorrelation between the data.

As was said, further tuning could be done here to improve this model but it was still beneficial in sanity checking the data preparation process.

In [64]:
mod_log_regression = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial', max_iter = 30000, random_state = rand_state).fit(X_train, y_train_bin)
models_classification.append(mod_log_regression)

mod_lin_regression = LinearRegression().fit(X_train, y_train['score'])
models_regression.append(mod_lin_regression)

mod_ridge_regression = Ridge(max_iter = 10000, random_state = rand_state).fit(X_train, y_train['score'])
models_regression.append(mod_ridge_regression)

mod_ridge_classifier = RidgeClassifier(max_iter = 10000, random_state = rand_state).fit(X_train, y_train_bin)
models_classification.append(mod_ridge_classifier)

mod_lasso_regression = Lasso(max_iter = 10000, random_state = rand_state).fit(X_train, y_train['score'])
models_regression.append(mod_lasso_regression)
/opt/homebrew/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Back to Table of Contents¶

8.3.2. Support Vector Machine (SVM): ¶

For hyperparameter tuning we will use a grid search with a range of C and Gamma values, and apply 3-fold cross-validation.

In [65]:
params = {'C': np.logspace(6, 10, base = 2), 'gamma': np.logspace(-10, 0, base = 2)}
grid_svm_regression = GridSearchCV(SVR(kernel = 'rbf'), param_grid = params, cv = 3).fit(X_train_scaled, y_train['score'])
mod_svm_regression = grid_svm_regression.best_estimator_
models_regression.append(mod_svm_regression)

grid_svm_classifier = GridSearchCV(SVC(kernel = 'rbf', random_state = rand_state, probability = True), param_grid = params, cv = 3).fit(X_train_scaled, y_train_bin)
mod_svm_classifier = grid_svm_classifier.best_estimator_
models_classification.append(mod_svm_classifier)
In [66]:
plotSearchGridSVM(grid_svm_regression)

scores = [x for x in grid_svm_regression.cv_results_["mean_test_score"]]

max_index = scores.index(max(scores))
max_score = max(scores)
opt_C = grid_svm_regression.cv_results_['param_C'][max_index]
opt_Gamma = grid_svm_regression.cv_results_['param_gamma'][max_index]

print(f'GridSearch results indicate the optimized parameters are, \nC: {opt_C}, Gamma: {opt_Gamma}, Validation R2 Score: {max_score}')

# OR use the attribute .best_params_
grid_svm_regression.best_params_
GridSearch results indicate the optimized parameters are, 
C: 1024.0, Gamma: 0.001124953927697749, Validation R2 Score: 0.8797566276176538
Out[66]:
{'C': 1024.0, 'gamma': 0.001124953927697749}
Back to Table of Contents¶

8.3.3. Decision Trees: ¶

A decision tree with no parameter changes is a good baseline

In [67]:
dt_classifier = DecisionTreeClassifier(max_depth = None, max_leaf_nodes = None, random_state = rand_state).fit(X_train, y_train_bin)
models_classification.append(dt_classifier)
print(dt_classifier.score(X_train, y_train_bin))
print(dt_classifier.score(X_test, y_test_bin))
1.0
0.30985915492957744
8.3.3.1 AdaBoost: ¶
In [68]:
adaboost_classifier = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1),
                   n_estimators = 100,
                   random_state = rand_state,
                   learning_rate = 0.5
                   ).fit(X_train, y_train_bin)
models_classification.append(adaboost_classifier)
print(adaboost_classifier.score(X_test, y_test_bin))
results_adaboost = list(zip(adaboost_classifier.predict(X_test), y_test['score']))
0.23943661971830985
Back to Table of Contents¶
8.3.3.2 Random Forest: ¶
In [69]:
params = {'max_depth': np.linspace(1, 25, 25, dtype = int), 'max_leaf_nodes': np.linspace(2, 98, 25, dtype = int)}
grid_random_forest_regression = GridSearchCV(RandomForestRegressor(n_estimators = 250, random_state=rand_state), 
                    param_grid = params, 
                    cv = 3, 
                    #verbose = 3, 
                    n_jobs = -1
                    ).fit(X_train, y_train['score'])
mod_random_forest_regression = grid_random_forest_regression.best_estimator_
models_regression.append(mod_random_forest_regression)
In [70]:
params = {'max_depth': np.linspace(1, 25, 25, dtype = int), 'max_leaf_nodes': np.linspace(2, 98, 25, dtype = int)}
grid_random_forest_classifier = GridSearchCV(RandomForestClassifier(n_estimators = 250, random_state=rand_state), 
                    param_grid = params, 
                    cv = 3, 
                    #verbose = 3, 
                    n_jobs = -1
                    ).fit(X_train, y_train_bin)
mod_random_forest_classifier = grid_random_forest_classifier.best_estimator_
models_classification.append(mod_random_forest_classifier)
In [71]:
plotSearchGrid(grid_random_forest_regression)
In [72]:
feat_importance = grid_random_forest_regression.best_estimator_.feature_importances_
feat_imp_std = np.std([tree.feature_importances_ for tree in grid_random_forest_regression.best_estimator_.estimators_], axis = 0)
forest_importances = pd.Series(feat_importance, index=bio_sleep_df.columns[4:])

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr = feat_imp_std, ax = ax)
ax.set_title("Feature Importance Regressor")
ax.set_ylabel("Mean Decrease in Impurity")
fig.tight_layout()
In [73]:
plotSearchGrid(grid_random_forest_classifier)
In [74]:
feat_importance = grid_random_forest_classifier.best_estimator_.feature_importances_
feat_imp_std = np.std([tree.feature_importances_ for tree in grid_random_forest_classifier.best_estimator_.estimators_], axis = 0)
forest_importances = pd.Series(feat_importance, index=bio_sleep_df.columns[4:])

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr = feat_imp_std, ax = ax)
ax.set_title("Feature Importance - Classifier")
ax.set_ylabel("Mean Decrease in Impurity")
fig.tight_layout()
Back to Table of Contents¶

9. Results ¶


9.1. Results Helper Functions: ¶

In [75]:
def scores_classification(models):
    """Calculates and returns accuracy score from fitted classifier models.

            Parameters:
                models (list): Iterable of fitted classifier model objects.
            
            Returns:
                Dictionary where model name is the key and it's accuracy score is the value. 
                    {'model name' : accuracy}"""
    # Create dictionary to return dataframe of scores.
    model_scores = {}
    print(f'Using target variable bin interval: {bin_type}')
    for i, mod in enumerate(models):
        # Check if model is the SVC(), uses the scaled features.
        if 'SVC' in str(mod):
            X_test_temp = X_test_scaled
        else:
            X_test_temp = X_test

        model_name = str(mod).partition('(')[0]
        accuracy_score = mod.score(X_test_temp, y_test_bin)
        f1score = f1_score(y_test_bin, mod.predict(X_test_temp), average = 'weighted')
        # Set scores to model in dictionary.
        model_scores[model_name] = {'Accuracy': accuracy_score,
                                    'F1_Score(class_weighted)': f1score}

    return model_scores

def ols_scores(model):
    """Calculates and returns regression metrics for statsmodels OLS() fitted models.

    Parameters:
        model (object): Fitted regressor model object.
    
    Returns:
        Nested dictionary where model name is the key and the model evaluation metrics are nested within. 
            {'model name' : {'R2': score,
                            'Adj_R2': score,
                            'RMSE': score,
                            'MAE': score}}"""

    y_pred = model.predict(sm.add_constant(X_test_scaled))
    r2 = model.rsquared
    adj_r2 = model.rsquared_adj
    rmse = mean_squared_error(y_test['score'], y_pred, squared = False)
    mae = mean_absolute_error(y_test['score'], y_pred)

    model_score = {'R2': r2,
                    'Adj_R2': adj_r2,
                    'RMSE': rmse,
                    'MAE': mae}

    return model_score

def scores_regression(models):
    """Calculates and returns R2, Adjusted R2, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) from fitted regressor models.

        Parameters:
            models (list): Iterable of fitted regressor model objects.
        
        Returns:
            Nested dictionary where model name is the key and the model evaluation metrics are nested within. 
                {'model name' : {'R2': score,
                                'Adj_R2': score,
                                'RMSE': score,
                                'MAE': score}}"""
    # Create dictionary to return dataframe of scores.
    model_scores = {}

    for mod in models:
        model_name = str(mod).partition('(')[0]

        # Check if model is the SVR(), uses the scaled features.
        if 'SVR' in model_name:
            X_test_temp = X_test_scaled
        elif 'statsmodels' in model_name:
            model_scores['LinearRegression(scaled)'] = ols_scores(mod)
            continue
        else:
            X_test_temp = X_test

        y_pred = mod.predict(X_test_temp)
        r2 = mod.score(X_test_temp, y_test['score'])
        n , p = X_test_temp.shape # number of samples and predictors
        adj_r2 = 1 - (((1 - r2) * (n - 1)) / (n - p - 1))
        rmse = mean_squared_error(y_test['score'], y_pred, squared = False)
        mae = mean_absolute_error(y_test['score'], y_pred)
        # Set scores to model in dictionary.
        model_scores[model_name] = {'R2': r2,
                                    'Adj_R2': adj_r2,
                                    'RMSE': rmse,
                                    'MAE': mae}

    return model_scores

def plot_predictions_vs_score_regression(models):
    """Prints to output (n x 3) subplot array of each model's predicted sleep scores and observed (truth) sleep scores for each X_test, y_test pair, sorted by descending sleep score.
    (where n = number of model objects). Can handle any number of input models.

    Parameters:
        models (list): Iterable of fitted regressor model objects.
    
    Returns:
        None. Prints plot to output."""
    
    fig, ax = plt.subplots(int(np.ceil(len(models) / 3)), 3, figsize = (11, 7))
    for i, mod in enumerate(models):
        # Check if model is the SVR() or OLS(), uses the scaled features.
        if 'SVR' in str(mod):
            X_test_temp = X_test_scaled
        else:
            X_test_temp = X_test
            
        # Check if model is a statsmodel object, if so use their score conventions.
        if 'statsmodels' in str(mod):
            temp_scores = ols_scores(mod)
            y_pred = mod.predict(sm.add_constant(X_test_scaled))
            r2 = temp_scores['R2']
            adj_r2 = temp_scores['Adj_R2']
            rmse = temp_scores['RMSE']
            mae = temp_scores['MAE']
            model_name = 'LinearRegression(scaled)'
        else:
            y_pred = mod.predict(X_test_temp)
            r2 = mod.score(X_test_temp, y_test['score'])
            n , p = X_test_temp.shape # number of samples and predictors
            adj_r2 = 1 - (((1 - r2) * (n - 1)) / (n - p - 1))
            rmse = mean_squared_error(y_test['score'], y_pred, squared = False)
            mae = mean_absolute_error(y_test['score'], y_pred)
            model_name = str(mod).partition('(')[0]

        results = list(zip(y_pred, y_test['score']))
        results = sorted(results, key = lambda x: x[1], reverse = True)
        
        ax[i // 3, i % 3].plot(results, alpha = 0.85, drawstyle = 'steps')
        ax[i // 3, i % 3].set_xlabel('Index (sorted)')
        ax[i // 3, i % 3].set_ylabel('Sleep Score')
        ax[i // 3, i % 3].set_title(model_name)
        ax[i // 3, i % 3].grid(alpha = 0.4)
        ax[i // 3, i % 3].text(x = 0.1, 
                            y = 0.2, 
                            transform = ax[i // 3, i % 3].transAxes,
                            s = (f"{mae:0.3f} - MAE"
                            '\n'f"{rmse:0.3f} - RMSE"
                            '\n'f"{r2:0.3f} - R2"
                            '\n'f"{adj_r2:0.3f} - Adj_R2"))
        # Check if odd number of models and delete last subplot if true.
    if len(models) % 2 != 0:
        fig.delaxes(ax[len(models) // 3, (len(models) % 3)])
        
    plt.subplots_adjust(wspace = 0.2, hspace = 0.4)
    fig.legend(['Predicted', 'Truth'], loc = 'upper right')
    fig.suptitle('Predicted Score and True Score', fontsize = 12)

    return None

def plot_obs_vs_pred_regression(models):
    """Prints to output (n x 3) subplot array of each model's observed vs predicted plots.
    (where n = number of model objects). Can handle any number of input models.

    Parameters:
        models (list): Iterable of fitted regressor model objects.
    
    Returns:
        None. Prints plot to output."""
    
    fig, ax = plt.subplots(int(np.ceil(len(models) / 3)), 3, figsize = (11, 7))
    for i, mod in enumerate(models):
        # Check if model is the SVR(), uses the scaled features.
        if 'SVR' in str(mod):
            X_test_temp = X_test_scaled
        else:
            X_test_temp = X_test

        # Check if model is a statsmodel object, if so use their score conventions.
        if 'statsmodels' in str(mod):
            y_pred = mod.predict(sm.add_constant(X_test_scaled))
            model_name = 'LinearRegression(scaled)'
        else:
            y_pred = mod.predict(X_test_temp)
            model_name = str(mod).partition('(')[0]

        ax[i // 3, i % 3].scatter(y_pred, y_test['score'], alpha = 0.5, s = 15)
        ax[i // 3, i % 3].plot(np.arange(min(y_test['score']), 100, 1), np.arange(min(y_test['score']), 100, 1), 'k--', alpha = 0.6)
        ax[i // 3, i % 3].set_xlabel('Predictions')
        ax[i // 3, i % 3].set_ylabel('Observations')
        ax[i // 3, i % 3].set_title(model_name)
        ax[i // 3, i % 3].grid(alpha = 0.4)

        # Check if odd number of models and delete last subplot if true.
    if len(models) % 2 != 0:
        fig.delaxes(ax[len(models) // 3, (len(models) % 3)])
        
    plt.subplots_adjust(wspace = 0.3, hspace = 0.4)
    fig.suptitle('Observed (Truth) vs Predicted', fontsize = 12)

    return None

def plot_models_roc(models):
    """Prints to output (n x 3) subplot array of each model's ROC curve plot.
    (where n = number of model objects). Can handle any number of input models.

    Parameters:
        models (list): Iterable of fitted classifier model objects.
    
    Returns:
        None. Prints plot to output."""
    # Need to use LabelBinarizer() since we have multiple classes.
    label_binarizer = LabelBinarizer().fit(y_train_bin)
    y_onehot_test = label_binarizer.transform(y_test_bin)
    for mod in models:
        if 'Ridge' in str(mod): # Can't calculate probabilities in Ridge.
            continue
        if 'SVC' in str(mod): # Use scaled X_test values
            y_scores = mod.predict_proba(X_test_scaled)
        else:
            y_scores = mod.predict_proba(X_test)
        fpr, tpr, _ = roc_curve(y_onehot_test.ravel(), y_scores.ravel())
        auc_score = auc(fpr, tpr)
        plt.plot(fpr, tpr, label = f"{str(mod).partition('(')[0]} - AUC = {auc_score:0.2f}")

    plt.plot(np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), 'k--')
    plt.title('ROC - Micro-Average OvR')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.grid(alpha = 0.4)
    plt.legend(loc = 'lower right')
    
    return None

# Function to iterate through classification models and plot confusion matrix together.
def plot_models_confusion_matrix(models):
    """Prints to output (n x 3) subplot array of each model's confusion matrix.
    (where n = number of model objects). Can handle any number of input models.

    Parameters:
        models (list): Iterable of fitted classifier model objects.
    
    Returns:
        None. Prints plot to output."""
    fig, ax = plt.subplots(int(np.ceil(len(models) / 3)), 3, figsize = (10, 8), sharey = True)
    for i, mod in enumerate(models):
        # Check if model is the SVC(), uses the scaled features.
        if 'SVC' in str(mod):
            X_test_temp = X_test_scaled 
        else:
            X_test_temp = X_test

        cm = confusion_matrix(y_test_bin, mod.predict(X_test_temp), labels = y_test_bin.sort_values().unique())
        disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = y_test_bin.sort_values().unique())
        disp.plot(ax = ax[i // 3, i % 3], xticks_rotation = 90)
        disp.ax_.set_title(str(mod).partition('(')[0])
        disp.im_.colorbar.remove()

    # Check if odd number of models and delete last subplot if true.
    if len(models) % 2 != 0:
        fig.delaxes(ax[i // 3, (len(models) % 3)])
    fig.tight_layout()
    plt.subplots_adjust(wspace = 0.2, hspace = 0)
    fig.colorbar(disp.im_, ax = ax)
    fig.suptitle('Confusion Matrices of Classifier Models')
    
    return None
Back to Table of Contents¶

9.2. Results - Evaluation Metrics: ¶

9.2.1. Regression Models: ¶

In [76]:
results_regression = scores_regression(models_regression)
display(pd.DataFrame(results_regression).T.sort_values(by = 'MAE'))
R2 Adj_R2 RMSE MAE
SVR 0.866523 0.820319 2.526703 2.041957
Lasso 0.838919 0.783160 2.775705 2.212797
Ridge 0.837268 0.780937 2.789897 2.220879
LinearRegression 0.837241 0.780901 2.790127 2.221564
LinearRegression(scaled) 0.881864 0.873748 2.790127 2.221564
RandomForestRegressor 0.665764 0.550067 3.998320 3.021399

Models have been sorted by Mean Absolute Error (MAE).

  • The Support Vector Machine performed the best across all metrics except Adjusted R2.
  • Comparing the two Linear Regression models you can see that centering and standardizing the predictor features improved the R2 score.

This looks great, if you recall our baseline for regression models was measuring MAE with the median sleep score of the training data (see below).

In [77]:
print(f"The baseline Mean Absolute Error (MAE) for regression models is: {mae_baseline_regress}")
The baseline Mean Absolute Error (MAE) for regression models is: 6.249110320284697

The SVM has improved that error by over 4. All models show improvements to this baseline.

In [78]:
plot_predictions_vs_score_regression(models_regression)

This graph and the following gives a good view on how each method handles different ranges of sleep score.

  • This graph shows how similar linear, ridge, and lasso performs with this dataset. Though no hyperparameter tuning was performed on Ridge or Lasso's alpha or tolerance values.
  • Comparing the Mean Absolute Error (MAE), we can visually see how much tighter the predictions are in relation to the ground truth coming from the reduction of 0.2 MAE in the Support Vector Machine (SVR).
  • The Random Forest model really struggles with prediction at the ends of the sleep score distribution, but is comparable to other methods close to the middle of the distribution. This is likely due to the majority of the data is concentrated around the median score of 77. More data would likely improve this model (and the others) substantially, especially in the tails.
In [79]:
plot_obs_vs_pred_regression(models_regression)

This graph illustrates the same as the previous and helps identify where the models performs best throughout the distribution. In this case it is also apparent which model performed best based on which plot concentrates the most points around the 45 degree line.

Back to Table of Contents¶

9.2.2. Classification Models: ¶

In [80]:
results_classification = scores_classification(models_classification)
display(pd.DataFrame(results_classification).T.sort_values(by = 'Accuracy', ascending = False))
Using target variable bin interval: score_bin_custom
Accuracy F1_Score(class_weighted)
SVC 0.633803 0.626915
LogisticRegression 0.605634 0.607407
RandomForestClassifier 0.563380 0.538434
RidgeClassifier 0.436620 0.376509
DecisionTreeClassifier 0.309859 0.304995
AdaBoostClassifier 0.239437 0.174365

The baseline accuracy for classifications models was using the Zero Rate, or a model that only ever predicts the most frequent class in the training data (see below).

In [81]:
print(f"The Zero Rate baseline for classification models that always chooses the most populous class is: {zero_rate_base_class}")
The Zero Rate baseline for classification models that always chooses the most populous class is: 0.2491103202846975

So our baseline accuracy rate of ~ 0.25 has been beaten quite a bit with the SVM accuracy score of ~ 0.63.

  • AdaBoost in this case is the only model not to beat that metric, though it beats the random chance of $\frac{1}{6}$.
In [82]:
plot_models_confusion_matrix(models_classification)
  • The Support Vector Machine (SVC) and Logistic Regression models performed the best. The majority of misclassified test samples in these models are being classified in adjacent classes.
  • The Random Forest model looks to be converging on being well-fit as well.
  • More data would likely help these models as well.
In [83]:
plot_models_roc(models_classification)

Because we have imbalanced classes, the best metric to compare these classification models is ROC Curve / AUC. Though, in this case the rankings do align with the Accuracy score. Accuracy score, however, doesn't show the full story on how the models are predicting with concern to False Positive Rate (FPR) and True Positive Rate (TPR) metrics.

In [84]:
print('Best Regression Model (by Mean Absolute Error (MAE)):') 
display(pd.DataFrame(results_regression).T.sort_values(by = 'MAE').iloc[0:1, :])
print('-----------------------------------------------------')
print('Best Classification Model (by Accuracy Score):')
display(pd.DataFrame(results_classification).T.sort_values(by = 'Accuracy', ascending = False).iloc[0:1, :])
Best Regression Model (by Mean Absolute Error (MAE)):
R2 Adj_R2 RMSE MAE
SVR 0.866523 0.820319 2.526703 2.041957
-----------------------------------------------------
Best Classification Model (by Accuracy Score):
Accuracy F1_Score(class_weighted)
SVC 0.633803 0.626915
Back to Table of Contents¶

10. Conclusions ¶


The Short Answer:¶

The SVM models consistently outperformed the other models in both regression and classification tasks, with potential improvements being gained through cross-validation and grid-search optimizations. However, the dataset's small size (281 training, 71 testing) limited these cross-validations and were not used in all models which likely compromised their results. As the dataset grows, models are expected to show reduced errors. Regression is favored for predicting sleep scores due to its compatibility with the dataset, providing interpretability and model performance. Also, it seems further tuning of the data could see improvements of the models based on linear regression.

The SVM regression model scored the best across all metrics except the Adjusted R2, surpassed by the linear regression model with scaled predictor features. The Mean Absolute Error (MAE) baseline score was improved from ~6.2 to ~2.0 indicating a large reduction in error.

The Long Answer and Tangent:¶

Potential Flaws:

As previously discussed, transforming a regression task into a classification one raises questions about the nature of the target variable score as a quantitative measure. It's also important to note the potential flaws in choosing score itself as the target variable. There is innate fuzziness of the sleep score metric since it comes from unknown methodologies by Oura. Also, it's debatable whether it is appropriate or accurate to capture sleep quality in discrete values ranging from 0 to 100. Even beyond consumer-grade sleep tracking, despite the complex techniques involved in medical-grade sleep studies, sleep quality is still only measured through correlated metrics and auxillary biological systems; only achieving a best approximation itself. All that said the Oura Ring, having used it for over a year, demonstrated a strong correlation between its sleep scores and personal perceptions of nightly sleep quality.

However, as the user of the Oura Ring now for over one year, I see a high amount of correlation between their sleep scores and how I perceived the previous night's sleep. The question to be asked here is how much of a difference is, say, a score of 81 vs 85 and what would be lost changing from discrete values to categorical binning of the target variable.

Evaluation Metrics Chosen:

In evaluating the regression models, the choice between Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) was important for evaluating prediction errors. MAE was ultimately chosen due to its ability to provide a clear understanding of average prediction errors, aligning well with the target variable score and a highly interpretable margin for predictions. This preference was reiterated by the presence of outliers in the small dataset, making MAE a more suitable metric than RMSE.

In evaluating the classification models, since the data split contained class imbalances, F1 score or AUC was the preferred metric to evaluate the models by.

Regression vs Classification and Interpreting the Results:

A consideration to make is the significance of difference between specific scores, such as 81 and 85, and the potential loss of information in transitioning from discrete values of the target variable to categorical binning. During data preparation we saw that sleep score values spanned 0-100 with the majority of the distribution around 60-90, and were binned into ordinal categorical intervals for classification purposes. The chosen binning intervals were as follows:

Binning Intervals Used in Training:

[0.000, 64.395, 72.798, 77.000, 81.202, 89.605, 100.000]

['Worst', 'Bad', 'Below Avg', 'Above Avg', 'Good', 'Best']

This transformation introduces several biases into the predictions:

  • Information loss occurred based on on the chosen bin size and interval edge placement.
  • Quantitative relationships between predictors and the now categorical target variable are lost.
  • While the target variable is simple to interpret, 'Bad' or 'Above Avg', accuracy in predictions becomes more important. A more specific answer specific answer with easy to interpret errors would be better.

Based on the results above we can see that regression gives us more interpretable and better results. Why is this true?

  1. Interpretability

    • With a MAE score of ~ 2.04, the regression model predicted scores within an average deviation of that MAE. A predicted score of 76 would likely fall within the range of 76 $\pm$ 2.
    • In comparison, the classification model, with an accuracy score of ~63.4%, consistently misclassified a true score of 76 to 'Above Avg' or 'Bad,' and occasionally to 'Worst,' resulting in a wider deviation of scores for predictions.
  2. Target Variable Alignment

    • A quantitative target variable aligns with the regression space.
      • As evidenced in the observed vs predicted plot, you can see how concentrated the predictions are to the ground truth.
    • By binning the target variable into categorical variables we are assuming there is negligible information in differences of the target variable by however large the chosen bin size is. If this is not the case, information is minimally smoothed, or lost altogether.

Future Improvements

  • Utilize an Ordinal Logistic Regression model to better save the ordinal quality of the target variable in the classification space.
Back to Table of Contents¶


In [85]:
# Exported via command line using:

# # HTML
# jupyter nbconvert Super_Sleep_Notebook.ipynb --to html
# jupyter nbconvert Super_Sleep_Notebook.ipynb --to html --HTMLExporter.theme=dark

# # PDF
# jupyter nbconvert Super_Sleep_Notebook.ipynb --to webpdf
# jupyter nbconvert Super_Sleep_Notebook.ipynb --to webpdf --HTMLExporter.theme=dark